Open todor-ivanov opened 10 months ago
@todor-ivanov Todor, I suspect this is actually a buggy job report with:
'locations': set(),
instead of an incomplete SQL statement. If there is something to fix, I would say we need to ensure that job reports report a non-empty location for the output files.
hi @amaltaro
Yes, I completely agree with you - The current reason is the a missing WMBS file location, which I already mentioned. And we even know the reason why this location for the current workflow was left empty. And this reason is not related to the current bug, but was rather a consequence of a separate development and a test in the context of the migration to the storage.json
. The current issue I created, because this whole set of DAO calls cited in the GH description, would inevitably lead to a complete crash of the component triggered by a single workflow (a file set with missing location this time or any other error of the sort in the future). To me, we should protect the component from interrupting under such scenario.
Yes, the approach we have been taking with the WMAgent components is that some errors are supposed to be soft while others are hard errors and a crash of the component is necessary to catch developers attention to properly investigate the issue. I agree that not having components crashing is the ultimate goal, but I fear that we won't pay enough attention to problems if we just treat everything as a soft error and leave developers with a sole notification of the error.
For this specific "no locations" issue, we used to have the same problem over the past many years and given the failure to properly identify and fix it at the root, we made this workaround: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobAccountant/AccountantWorker.py#L504-L508
which I expected to catch this case as well.
In addition to the solution you provided, it's not clear to me what you suggest to be done with the potentially ill job? Would retry it until someone take care of that?
Impact of the bug WMagent - JobAccountant
Describe the bug
During the work of migration to
storage.json
forstage in
andstage out
, we have stumbled on a failure [1] of the JobAccountant component which was caused due to a broken SQL query in one of the steps of the component's cycle. This was firstly reported here: https://github.com/dmwm/WMCore/pull/11790#issuecomment-1855343294Even though the origin of the error is in a change that has left a WMBS file record without
location
, the effect is definitely undesired - the component completely failed. Even more, once the workflow itself has been aborted, the records in WMBS were still remaining and the component continued crashing. The place in JobAccountant, where this exception is thrown is here:https://github.com/dmwm/WMCore/blob/87ea437f508308b5e1d544a234a80e2a9911d5c1/src/python/WMComponent/JobAccountant/AccountantWorker.py#L836-L867
How to reproduce it By breaking a file record in WMBS
Expected behavior The component should not break due to a single broken record in the database. Instead:
Additional context and error message [1]