Closed jonzarling closed 3 years ago
This would actually be more complex and more work; If a file isn't found....is it because it doesn't exist or isn't accessible? Okay, so now we need to check all locations. But what if because of some networking/internal issue the submit host can't access all locations? Do we try again? Do we just give up? I've thought a lot about it and this is a pandora's box.
This is not laborious or rather should not be. A simple query: SELECT DISTINCT RunNumber from Jobs where IsActive=0; or for more detail (e.g version Set) I can come up with a different query.
if you click on your project and sort the isActive to have the 0's up front you will see a list of all the jobs that failed due to missing random triggers. In fact the system will confirm it doesn't have the files before truly deactivating, otherwise it returns exit code 232 (I think it is 232). The system, before resubmitting any job whose last attempt fails with 232, confirms its existence or not.
Having these, I would argue, makes it easy to find the missing random trigger files, versus people just not running some run numbers some times....
Had an oral discussion, which I'll summarize here. Basically if one is submitting to OSG then the back end will handle these sorts of issues, hence the scope of the issue is limited to other batch systems.
The easiest approach, at least for MCWrapper to handle, is adding a new table to the RCBD. One could then add a query to skip over runs without random triggers, as is already done for the "is_production" flag. This is an external change that would need to be discussed elsewhere.
For what it's worth, we are in the process of regenerating the random trigger files, so there shouldn't be any runs without random triggers.
So maybe it's easier to wait for that to be finished.
Ok, I'll close this under the assumption that we shouldn't have run holes in the future.
If one is on a system that can check to see whether random triggers exist for a given run or not, I think we need to add a default behavior to not submit jobs without random trigger files (assuming one is trying to fold these in). It becomes rather laborious to check whether jobs failed due to missing random triggers or due to some other error, seems that not submitting in the first place would be better.