In a recent occurrence we had thousands of jobs go held with errors like reading from file /storage/local/data1/condor/execute/dir_1221925/.empty_file: (errno 2) No such file or directory, with the underlying cause actually being an issue with apptainer on the worker node, so the entire payload including the wrapper script could not be run.
We should come up with a way to unmask errors like that, if indeed we get something more useful (which I don't know and don't immediately know how to test, without a known-bad worker node). Maybe the job wouldn't even go held, and would just get re-queued?
Ideally we wouldn't even have to transfer back a dummy file, but that only seems possible currently by setting transfer_output = False, which is only applicable to the Grid universe for some reason. Could ask Condor team about that. Or look into using Grid universe. That's a big solution to a little problem though.
Maybe if we created .empty_file at submission time, and added it to transfer_input_files, that would make it always available? (barring some error transferring input, which would be a problem regardless)
In a recent occurrence we had thousands of jobs go held with errors like
reading from file /storage/local/data1/condor/execute/dir_1221925/.empty_file: (errno 2) No such file or directory
, with the underlying cause actually being an issue with apptainer on the worker node, so the entire payload including the wrapper script could not be run.The
.empty_file
is currently created in the wrapper script at https://github.com/fermitools/jobsub_lite/blob/2d2b350d9c0b389a910ffc770e999a7292690551/templates/simple/simple.sh#L193We should come up with a way to unmask errors like that, if indeed we get something more useful (which I don't know and don't immediately know how to test, without a known-bad worker node). Maybe the job wouldn't even go held, and would just get re-queued?
transfer_output = False
, which is only applicable to theGrid
universe for some reason. Could ask Condor team about that. Or look into usingGrid
universe. That's a big solution to a little problem though..empty_file
at submission time, and added it totransfer_input_files
, that would make it always available? (barring some error transferring input, which would be a problem regardless)