Open bbockelm opened 7 years ago
Do we have FWJR from these jobs?
Nah - unfortunately, they get tossed by the jobs. Hence, the suggestion above is only theoretical -- based on reading of the code.
In parallel, Chris Jones found & fixed a race condition / bug in the CMSSW_9_X series that might be causing some fields to be unexpectedly 0
. That could be another source of the problem.
I thought from the e-mail discussion that these jobs were being treated as failed. But I don't see anything here except that some data doesn't get filled in FWJR and a log message is issued. Does the failure come further on or did I misunderstand? (I.e. I don't see anything a priori wrong with this code.)
Nah - the agent infrastructure treats them as successful; they just looked like failures because storage information is missing (WMRuntime zeros out all data).
There are at least two problems here:
Both are irritating, but the latter is more concerning given that it is an important use case.
I noticed that a few workflows were getting nearly-100% of jobs reporting the following at certain sites:
(but other sites were performing just fine).
Looks like the message comes from here: https://github.com/dmwm/WMCore/blob/bfe084e94ec7351e71059c8bb1eb4b0cc2dfe5c9/src/python/WMCore/FwkJobReport/XMLParser.py#L489 . Unfortunately, the exception message doesn't include enough information to know what's wrong -- but certainly the jobs were successful and had storage statistics enabled.
I think there's some invalid assumption in the code that parses the FJR.
The most likely candidate is the fact that it assumes there is only one protocol for reading. In the case of fallback, it's likely that an onsite protocol (such as POSIX) and an offsite protocol (such as xrootd) are used.
It appears that the generated job report only happens to report whatever protocol was returned last in the hash.