fermitools / jobsub_lite

jobsub_lite is a wrapper for HTCondor job submission
Apache License 2.0
1 stars 7 forks source link

If a disk quota check errors out, we shouldn't fail the submission #539

Closed shreyb closed 6 months ago

shreyb commented 6 months ago

In 1.6, we added a disk space check to jobsub_lite. Right now, if that check fails (that is, there's any error in CHECKING the quota), jobsub_lite errors out (see ServiceNow incident INC000001168697). This was the traceback from that incident:

Traceback (most recent call last):
  File "/opt/jobsub_lite/bin/jobsub_submit", line 224, in <module>
    main()
  File "/opt/jobsub_lite/lib/tracing.py", line 172, in wrapper
    res = func(*args, **kwargs)
  File "/opt/jobsub_lite/bin/jobsub_submit", line 189, in main
    set_extras_n_fix_units(varg, schedd_name, cred_set)
  File "/opt/jobsub_lite/lib/utils.py", line 261, in set_extras_n_fix_units
    if not check_space(args["submitdir"], need_blocks):
  File "/opt/jobsub_lite/lib/utils.py", line 512, in check_space
    fs, d_ok = check_space_df(path, min_kblocks, min_files, verbose)
  File "/opt/jobsub_lite/lib/utils.py", line 544, in check_space_df
    availcol = headers.index("Available")
ValueError: 'Available' is not in list

I think this is a bug in the behavior, as the quota check could fail due to a number of reasons (like the user submitting it from the wrong dir) that might not actually fail in the jobs getting submitted.

I propose that if the running of the quota check errors out in this way (NOT if the check itself runs properly and returns that there is insufficient space), we should catch that error and print a warning that there was an error running the check, but that we'll try to submit anyway.

shreyb commented 6 months ago

@marcmengel What do you think about this idea? If you agree to it, I don't think we need to bring it to the whole working group, since IMO it's really a bug and not a new feature request.

marcmengel commented 6 months ago

This is clearly a bug, or possibly 2 or 3 convolved :-(. If I'm parsing the output, I should be setting LC_ALL=C or some such and I'm not, second, even so there could be some reason we don't get any output from the command, and we should just admit we can't check here and ignore it...

retzkek commented 6 months ago

So I want to back up here. Why was the decision made to solve #506 by checking disk space and quota instead of improving the error handling of os errors? We should be looking to simplify whenever possible, not add more complexity. It’s supposed to be “lite” right?

marcmengel commented 6 months ago

Checking disk space first is better, because filling up the disk and then noticing you've done so also breaks other things -- it's a shared resource. Of course, we should do both -- handle the os errors and check first...

retzkek commented 6 months ago

Filesystems have quotas to protect shared resources, and the error handling should be able to unwind and clean up, leaving the system in the same state it was before. That's where I'd like to see effort spent, instead of piling on complexity to try to foresee every thing, where the complexity itself causes more issues.

shreyb commented 6 months ago

So @retzkek you'd rather us catch the OSError when trying to write files here, for example:

https://github.com/fermitools/jobsub_lite/blob/fcdb2725d27d8be801bff29bbfe435c2fae9c208/lib/render_files.py#L77

And print out a helpful error message rather do any checks there at all? Am I understanding correctly?

retzkek commented 6 months ago

Yes, and also go back through and clean up the files that were created before the error occurred (maybe that already happens higher up the stack?).

And I'm not arguing against any checks, only against complex checks that substantially increase the scope, surface area, or potential side-effects of what the code is supposed to be doing.