dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

T0 JobAccountant crashes due to multiple jobs generating same file #10870

Closed germanfgv closed 2 years ago

germanfgv commented 2 years ago

Impact of the bug T0Agent

Describe the bug Two or more jobs create output files with the same name. JobAccountant tries to add them to the DB and fails due to a ORA-00001: unique constraint (WMBS_FILDETAILS_UNIQUE) violated

We have seen the issue affecting Express jobs, but it may be affecting Repack and PromptReco too.

How to reproduce it Deploy a Tier0 replay with a significant amount of jobs. As soon as the firsts batches of Express jobs finish, the JobAccountant will crash.

Expected behavior Each job should create files with unique names

Additional context and error message Here you can find the full error message of the component ComponentLog.txt

Full JobAccountant logs can be found here: /afs/cern.ch/user/c/cmst0/public/JobAccountant/JobAccountantLogs

Logs of a set of jobs generating files with the same name: /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs

amaltaro commented 2 years ago

Possibly related to (if not the same root cause): https://github.com/dmwm/WMCore/issues/9633

germanfgv commented 2 years ago

@amaltaro for what I understand, in #9633 duplicate files are created with a default name and that causes the JobAccountant to crash. In our case, however, that doesn't seems to be happening. For example, for jobs in /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g1/, the duplicate file is:

/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/86A2833A-2C31-11EC-AF9E-D0C08E80BEEF.root

Could this still have the same root cause?

todor-ivanov commented 2 years ago

Hi @german,

It could still be related to #9633 , even though the duplicate filename is not due to some defaults as in the issue above. I hope we can get back to it during this week. Do you still have some logs in a machine which is experiencing it? I am not sure how easily reproducible the error is.

germanfgv commented 2 years ago

It is fairly easy to reproduce. It appears in every replay we run. Just tell me what you need I'm sure I can provide it for you.

amaltaro commented 2 years ago

In order to fix it, I think we need to start making the uuid a function of:

It should be possible to adapt it with uuid5 algorithm. More info at: https://docs.python.org/3/library/uuid.html

amaltaro commented 2 years ago

@khurtado this one is tightly coupled to https://github.com/dmwm/WMCore/issues/9011 , and I think that the fix you will propose will actually close this and #9011 issues. I assigned this one to you and moved to Work in progress as well.

khurtado commented 2 years ago

Thanks Alan!

khurtado commented 2 years ago

Here are my findings:

At runtime, we make a PSet tweak to change the name of the output files based on the output modules:

        lfn = "%s/%s/%s.root" % (lfnBase, lfnGroup(job), modName)
        result.addParameter("process.%s.logicalFileName" % modName, lfn)

https://github.com/dmwm/WMCore/blob/baf7ae586483d52f0b87850b411225f17bb918ed/src/python/PSetTweaks/WMTweak.py#L522-L524

So, the output files have a pattern like:

/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/RAWSIMoutput.root

After cmsRun is done, we change the lfn of the file, and with that the filename at stageout. So the file then looks like:

/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/2AE85F14-94A1-EC11-BBF5-FA163EC7AA59.root

But here is the thing, we don't set thee uuid for the filename. We basically grab the GUID from the generated Framework XML job report here:

https://github.com/dmwm/WMCore/blob/6cab2cbec356c63f9c175ac21995dc199ea0ad5d/src/python/WMCore/FwkJobReport/FileInfo.py#L110-L111

The guid generated by FWCore: https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/Guid.cc#L18-L28

And we can't just change the file lfn to use our own uuid for the filename, since we also enforce and hceck the guid in the filename using this utility:

https://github.com/cms-sw/cmssw-wm-tools/blob/master/bin/cmssw_enforce_guid_in_filename.py

So it seems the best solution here, is reporting the issue to cmssw and have it fixed on that end.

khurtado commented 2 years ago

@germanfgv Do you happen to have or know anyone with direct access to the T2_CH_CERN worker nodes? Specifically the nodes from here:

[khurtado@lxplus708 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs]$ find . -name "condor*out" -exec grep -H Hostname {} \; | sort
./g1/job_1038/condor.41731.37.out:Hostname:   b7g18p9798.cern.ch
./g1/job_1048/condor.41731.47.out:Hostname:   b7g18p9798.cern.ch
./g2/job_98/condor.41726.97.out:Hostname:   b7g18p7310.cern.ch
./g2/job_99/condor.41726.98.out:Hostname:   b7g18p7310.cern.ch
./g3/job_1137/condor.41731.136.out:Hostname:   b7g17p4406.cern.ch
./g3/job_1138/condor.41731.137.out:Hostname:   b7g17p4406.cern.ch
./g4/job_1169/condor.41731.168.out:Hostname:   b7g18p3673.cern.ch
./g4/job_1170/condor.41731.169.out:Hostname:   b7g18p3673.cern.ch
./g5/job_527/condor.41729.133.out:Hostname:   b7g10p4995.cern.ch
./g5/job_530/condor.41729.136.out:Hostname:   b7g10p4995.cern.ch
./g5/job_531/condor.41729.137.out:Hostname:   b7g10p4995.cern.ch
./g6/job_4623/condor.41780.47.out:Hostname:   b7g17p1733.cern.ch
./g6/job_4635/condor.41780.59.out:Hostname:   b7g17p1733.cern.ch

CSSW is asking to check the availability of /dev/urandom and contents of /proc/sys/kernel/random/entropy_avail, which I still don't know if they change inside containers though, but it should be trivial to check for this after invoking singularity.

khurtado commented 2 years ago

@germanfgv From your logfiles on:

/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs

Would you be able to get a pair of job logfiles with outputs other than DQMIO?

For example, your g3 or g4 directories do have RAW, ALCARECO in the list of duplicated files in the json, but only the DQMIO job logfiles are present.

The cmssw folks are requesting this on: https://github.com/cms-sw/cmssw/issues/37240

To summarize: It looks like the DQMIO issue is understood, but they need more information for the ALCARECO, RAW, etc.

germanfgv commented 2 years ago

@khurtado please check the logs that you can find here: /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4/job_1106

I'll check if we have more logs. If not, we can try and generate more examples.

khurtado commented 2 years ago

Hi @germanfgv. So, for g4, I can see the duplicated DQMIO file from job 1169 and 1170, but not 1106:

[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep DQMIO
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root

I couldn't find duplicated names for ALCARECO and RAW though, am I perhaps looking in the wrong way?

# ALCARECO
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep ALCARECO
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/98e014a9-8224-49b9-b5f9-5a77fca89a16.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/b0e5793e-1d8e-4a04-a5c4-f68ea475692e.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root

# RAW
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep RAW
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/40ec5397-771e-4478-91d0-45e7c63aec5d.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/b22d070b-8f19-4811-a9a3-81d202072653.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$
germanfgv commented 2 years ago

According to file g4/dupPickles.json, line 11157, these are the dub LFNs:

    "dup_lfns": [
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root",
      "/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-c2912657-08db-4d3d-9f8a-a8c949da8c68-0-logArchive.tar
.gz",
      "/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-c2912657-08db-4d3d-9f8a-a8c949da8c68-0-logArchive.tar
.gz"
    ]

Back in the day, I also couldn't find them directly, that's why I just copied the logs for the DQM jobs. Maybe @amaltaro can clarify to us what this "dup_lfns" list means exactly.

khurtado commented 2 years ago

@germanfgv That's a good point. Looking at all the guids in the json for g4, I could only find duplicated guids for 2 DQMIO files.

[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep guid dupPickles.json | awk '{print $2}' | sort | uniq -c 2>&1| grep -v " 1"
      2 "CFACDE90-2C31-11EC-A88A-C5C08E80BEEF",
      2 "D372A94C-2C31-11EC-AA52-D0C08E80BEEF",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep CFACDE90-2C31-11EC-A88A-C5C08E80BEEF dupPickles.json | grep PFN
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root?eos.app=cmst0",
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root?eos.app=cmst0",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep D372A94C-2C31-11EC-AA52-D0C08E80BEEF dupPickles.json | grep PFN
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/D372A94C-2C31-11EC-AA52-D0C08E80BEEF.root?eos.app=cmst0",
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/D372A94C-2C31-11EC-AA52-D0C08E80BEEF.root?eos.app=cmst0",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$

Maybe what happens is whenever there is a SQL WMBS error, dupLFNs lists all the LFNs listed in the job parameters? E.g.: SQL error below, so listing all LFNs from parameters @amaltaro?

[SQL: INSERT INTO wmbs_file_details (id, lfn, filesize, events,
                                            first_event, merged)
             VALUES (wmbs_file_details_SEQ.nextval, :lfn, :filesize, :events,
                     :first_event, :merged)]
[parameters: [{'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00000/92b873a5-603b-446c-b50e-4aebb8441650.root', 'filesize': 1436612, 'events': 2236, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00000/D960EA12-2C2C-11EC-B296-53878E80BEEF.root', 'filesize': 135147, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/0059b267-4a6a-4ce3-9313-1a008b695744.root', 'filesize': 382715936, 'events': 2236, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-fa0ff51e-0a73-4568-85e9-01ca1ccb896c-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00000/be5d2ccc-6f8e-476d-921a-4aa925743197.root', 'filesize': 1479816, 'events': 2306, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00000/E4637F92-2C2C-11EC-A75F-040011ACBEEF.root', 'filesize': 135151, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/c4b38c53-0fc0-440d-a387-abba8c6d1017.root', 'filesize': 394714912, 'events': 2306, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-fcb3597a-8a81-4f02-85e9-ffba73866d56-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}  ... displaying 10 of 64 total bound parameter sets ...  {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/44820b14-9c79-405e-9188-47ba12f864c5.root', 'filesize': 395093657, 'events': 2308, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-4b84d92e-b659-4339-a913-09fdc47dc356-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}]]

By the way, if that is the case, it would be good news, as they already know how to fix DQMIO, which uses an old uuid algorithm, but were surprised the others were also wrong, as they use a new algorithm.

amaltaro commented 2 years ago

If I remember correctly - and the docstring is correct - this script is loading all the X pickle reports in the tail of the component log, listing the output files on those report files and comparing them against the files known to WMBS tables. At some point I think I also added a check for multiple job reports - from memory - with the same output LFN.

Just in case, here is the source code: https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py

If you spot any mistake, I would be glad to follow up and get it fixed :-D

khurtado commented 2 years ago

@amaltaro Here: https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py#L48-L49

I think it's assuming logFiles will throw unique pkl paths. However, if I look into a tier0 ComponentLog example, I do see some pkls being shown more than once. E.g.:

[khurtado@vocms047 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobAccountant]$ grep '/data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl'  ComponentLog
2022-03-12 03:15:56,422:139997933635328:INFO:AccountantWorker:Handling /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
2022-03-12 05:08:43,681:140397941163776:INFO:AccountantWorker:Handling /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl

Or:

[khurtado@vocms047 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobAccountant]$ tail -n 500000  ComponentLog | grep 'install\/tier0\/JobCreator\/JobCache' | awk '{print  $3}' |  sort | uniq  -c | grep -v '1 '
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_11/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_12/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_13/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_15/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_19/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_20/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_22/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_23/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_24/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_25/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_27/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_28/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_29/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_30/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_31/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_33/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_34/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_372/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_373/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_375/Report.0.pkl

So I think that will make lfn2PklDict to sometimes have 1 lfn with more than 1 pkl path that is in reality the same pkl path. And that is why in such cases, all output files from the plk path are shown (ALCARECO, DQMIO, RAW, Log-tarball)

amaltaro commented 2 years ago

Thanks for this investigation, Kenyi.

I think you are right and I have just pushed in a commit to fix this issue: https://github.com/amaltaro/ProductionTools/commit/af153ac26bea0d6437e24227719338247826007c

@germanfgv next time you need to run it, please make sure to fetch the latest master/head version.

khurtado commented 2 years ago

Thank you @amaltaro ! @germanfgv @jhonatanamado Could you guys please submit another REPLAY to reproduce this error and check with the new change in the removeDupJobAccountant.py that Alan just pushed on github?

This is to confirm whether we are only seeing issues with DQMIO or more than that. If only DQMIO are seen, this should be an easy fix for cmssw, as they just new to make DQMIO point to the new guid algorithm that other modules are already using.

khurtado commented 2 years ago

@germanfgv @jhonatanamado Just wondering if you got the chance to try another replay. Let me know if you need any additional info on this matter.

jhonatanamado commented 2 years ago

Hello @khurtado . Im deploying a new replay and will give you the new results asap.

jhonatanamado commented 2 years ago

Hello @khurtado , Kenyi you will find two logs here /afs/cern.ch/user/j/jamadova/public/WMCore/JobAccountant The componentlog and the log of the removeDupJobAccountant.py with the changes proposed by Alan. Let me know if you need more info.

khurtado commented 2 years ago

Hi @jhonatanamado Thanks! So if I understand correctly

Found 406 unique pickle files to parse with a total of 319 output files and 1 duplicated files to process among them.
Duplicate files are:
['/store/unmerged/data/Tier0_REPLAY_2022/StreamCalibration/DQMIO/Express-v5/000/345/755/00000/2AA061C6-AB32-11EC-ADE7-B9C08E80BEEF.root']
See dupPickles.json for further details ...
Can we automatically delete those pickle files? Y/N
Y
Deleting /data/tier0/srv/wmagent/3.0.3/install/tier0/JobCreator/JobCache/Express_Run345755_StreamCalibration_Tier0_REPLAY_2022_ID220324053350_v5_220324_0535/Express/JobCollection_1_0/job_906/Report.0.pkl ...
  Done!

Now loading all LFNs from wmbs_file_details ...
Retrieved 60594 lfns from wmbs_file_details

Only 1 DQMIO was found, right? @amaltaro Do you think we need more tests? EDIT: For the recored, after talking to Alan, we are considering DQMIO the only issue now. If we spot issues with other modules in the future, we can create another issue with cmssw

jhonatanamado commented 2 years ago

Hi @khurtado , Yes JobAccountant starts with this issue with that file, I only deployed the replay and let hits only this first exception. The replay could find more duplicates files as we are used to see it. I only posted the first exception due that we are running a cronjob for all the machines (including Production Agent) restarting this component every certain time. Do you want a full replay and check which other files are affected after the deletion of the duplicate file takes place and restarting the component?

khurtado commented 2 years ago

@jhonatanamado I have already asked cmssw to fix the DQMIO issue. As things are now and with the current tests, it seems that is the only output module problem, so let's wait for that to be fixed and if you spot more duplicated LFNs from other modules in the future, let us know.

germanfgv commented 2 years ago

@khurtado So far I have not been unable to find examples of duplicate files other than DQMIO. Not in replays not in production. I have seen records after around 20 JobAccountant duplicate file errors, and all of them were DQMIO files.

drkovalskyi commented 2 years ago

Hi guys, we need to fix it asap. It affects Tier0 operations and detector commissioning. @khurtado, could you please point me to an issue that can be tracked with the CMSSW release managers, where you requested the problem to be fixed? If it was a private communication, who you contacted and what is the expectation for the it will be fixed?

khurtado commented 2 years ago

@drkovalskyi Yes, here it s: https://github.com/cms-sw/cmssw/issues/37240

drkovalskyi commented 2 years ago

Thanks Kenyi.

khurtado commented 2 years ago

@germanfgv @drkovalskyi : Was this fixed with https://github.com/cms-sw/cmssw/issues/37240 ? Can this issue be closed or is there anything needed from WMCore?

EDIT: It was reported during the WMChat meeting that no new occurrences have been seen since the fix, so closing this ticket.