flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
166 stars 49 forks source link

flux-exec: Error: rank 0: cat: Value too large for defined data type #4572

Open garlick opened 1 year ago

garlick commented 1 year ago

Redirecting input via flux exec works in the shell, but when launched inside CTI, I'm getting the errors

2022-09-15T14:06:03.224712Z broker.err[0]: channel buffer error: rank = 0 pid = 13014, stream = stdin, len = 1048576: Success
flux-exec: Error: rank 0: cat: Value too large for defined data type
2022-09-15T14:06:03.255490Z broker.err[0]: server_write_cb: lookup_pid: No such file or directory

I'm using Flux 0.40.0-15, it happens with cat, sed, and a minimal C program that redirects input. Haven't seen this before in CTI when launching other programs, but it could be something with the input redirection.

Originally posted by @ardangelo in https://github.com/flux-framework/flux-core/issues/3631#issuecomment-1248153865

garlick commented 1 year ago

Based on a cursory look - we have a 4MB buffer in the broker and we treat filling it up as a fatal error. This probably wants end to end flow control but for now I wonder if we can handle the "buffer full" write error by backing off and retrying?

ardangelo commented 1 year ago

Update on this, I have been trying this approach again to ship files, For small files it usually completes successfully. However, with larger files (100+ MB), the first invocation will often fail, but an immediate retry will work.

The error is different in this case,

cat /tmp/cti-adangelo/cti_daemonrW8WUO1.tar | flux exec -r 0 sed -n 'w /tmp/flux-71aFeA/jobtmp-0-ƒ9HCP58r2B/cti_daemonrW8WUO1.tar'
May 11 14:32:40.908869 broker.err[0]: Error writing 65536 bytes to subprocess pid 17760 stdin
May 11 14:32:40.911301 broker.err[0]: Error writing 65536 bytes to subprocess pid 17760 stdin: unknown pid
May 11 14:32:40.913095 broker.err[0]: Error writing 65536 bytes to subprocess pid 17760 stdin: unknown pid
(Repeated)
grondo commented 1 year ago

Unrelated to the actual bug discussed in this issue, I'll note that @garlick developed a better method for shipping files via flux-filemap(1).

This is integrated into a stage-in job shell option if that works in your use case. See the flux-shell(1) manpage for a description of the options.

Edit: though I didn't find any examples in the documentation of steps required to use to the stage-in plugin. We may want to add that. For now feel free to ask questions where things are not self-explanatory!

Edit2: There are some examples in the flux-filemap(1) manpage, but they do not include use of the stage-in job shell option.

ardangelo commented 1 year ago

We're using flux-filemap to ship files from the broker node to the non-broker nodes, but we still needed a way to get the file from the frontend where we're running our debugger tools to the broker node.

Although currently, we only are supporting running our tools from inside the flux start shell. Could we add files to the filemap directly in that case without worrying that the broker would be running somewhere else?

grondo commented 1 year ago

I've posted your question as the beginning of a Discussion thread here: #5168

I'm pretty confident flux-filemap(1) will handle your use case, but since it isn't clear, we can use the discussion in the Q&A thread to perhaps improve documentation or add a FAQ. It might help to give more specifics of how you're trying to use flux-filemap over in that issue. Thanks!