dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

JobAccountant incorrectly failing jobs that lack file location #12092

Closed amaltaro closed 2 months ago

amaltaro commented 2 months ago

Impact of the bug WMAgent

Describe the bug As described in these 2 comments: https://github.com/dmwm/WMCore/issues/11956#issuecomment-2329503741 and https://github.com/dmwm/WMCore/issues/11956#issuecomment-2329515605

we can conclude that the way JobAccountant deals with jobs that do not have any location for one or more of the output files is not sufficient. The component is properly failing those jobs, but somehow the output files from such jobs are still considered for upcoming jobs (e.g. merge jobs), which is a terrible mistake!

How to reproduce it It was never understood how output files are reported without any location, so the reproducibility of this issue is hard to say.

Expected behavior First, a job should never report an output file without any location. But this is somehow out of the scope of this issue.

Second, if JobAccountant identifies such jobs, in addition to marking those jobs as failed, it should as well disregard any output files generated by those jobs. Those output files cannot be considered for subsequent jobs down in the task dependency chain.

Additional context and error message None

amaltaro commented 1 month ago

Hi @hassan11196 , I compiled a list of potential datasets/workflows with duplicate lumis and placed them in this text file: https://amaltaro.web.cern.ch/forWMCore/Issue_12092/potential_dups.txt

Please let me know if you prefer it in a different format.

This list is based on: