fermitools / jobsub_lite

jobsub_lite is a wrapper for HTCondor job submission
Apache License 2.0
1 stars 7 forks source link

(non-)creation of .empy_file in wrapper can mask underlying errors #560

Open retzkek opened 7 months ago

retzkek commented 7 months ago

In a recent occurrence we had thousands of jobs go held with errors like reading from file /storage/local/data1/condor/execute/dir_1221925/.empty_file: (errno 2) No such file or directory, with the underlying cause actually being an issue with apptainer on the worker node, so the entire payload including the wrapper script could not be run.

The .empty_file is currently created in the wrapper script at https://github.com/fermitools/jobsub_lite/blob/2d2b350d9c0b389a910ffc770e999a7292690551/templates/simple/simple.sh#L193

We should come up with a way to unmask errors like that, if indeed we get something more useful (which I don't know and don't immediately know how to test, without a known-bad worker node). Maybe the job wouldn't even go held, and would just get re-queued?

  1. Ideally we wouldn't even have to transfer back a dummy file, but that only seems possible currently by setting transfer_output = False, which is only applicable to the Grid universe for some reason. Could ask Condor team about that. Or look into using Grid universe. That's a big solution to a little problem though.
  2. Maybe if we created .empty_file at submission time, and added it to transfer_input_files, that would make it always available? (barring some error transferring input, which would be a problem regardless)
  3. something else?