dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

WMCore FJR parsing throws exception; no storage statistics #8031

Open bbockelm opened 7 years ago

bbockelm commented 7 years ago

I noticed that a few workflows were getting nearly-100% of jobs reporting the following at certain sites:

ERROR:root:Tried to divide by zero doing storage statistics report parsing

(but other sites were performing just fine).

Looks like the message comes from here: https://github.com/dmwm/WMCore/blob/bfe084e94ec7351e71059c8bb1eb4b0cc2dfe5c9/src/python/WMCore/FwkJobReport/XMLParser.py#L489 . Unfortunately, the exception message doesn't include enough information to know what's wrong -- but certainly the jobs were successful and had storage statistics enabled.

I think there's some invalid assumption in the code that parses the FJR.

The most likely candidate is the fact that it assumes there is only one protocol for reading. In the case of fallback, it's likely that an onsite protocol (such as POSIX) and an offsite protocol (such as xrootd) are used.

It appears that the generated job report only happens to report whatever protocol was returned last in the hash.

ericvaandering commented 7 years ago

Do we have FWJR from these jobs?

bbockelm commented 7 years ago

Nah - unfortunately, they get tossed by the jobs. Hence, the suggestion above is only theoretical -- based on reading of the code.

In parallel, Chris Jones found & fixed a race condition / bug in the CMSSW_9_X series that might be causing some fields to be unexpectedly 0. That could be another source of the problem.

ericvaandering commented 7 years ago

I thought from the e-mail discussion that these jobs were being treated as failed. But I don't see anything here except that some data doesn't get filled in FWJR and a log message is issued. Does the failure come further on or did I misunderstand? (I.e. I don't see anything a priori wrong with this code.)

bbockelm commented 7 years ago

Nah - the agent infrastructure treats them as successful; they just looked like failures because storage information is missing (WMRuntime zeros out all data).

There are at least two problems here:

Both are irritating, but the latter is more concerning given that it is an important use case.