Closed germanfgv closed 2 years ago
Possibly related to (if not the same root cause): https://github.com/dmwm/WMCore/issues/9633
@amaltaro for what I understand, in #9633 duplicate files are created with a default name and that causes the JobAccountant to crash. In our case, however, that doesn't seems to be happening. For example, for jobs in /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g1/
, the duplicate file is:
/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/86A2833A-2C31-11EC-AF9E-D0C08E80BEEF.root
Could this still have the same root cause?
Hi @german,
It could still be related to #9633 , even though the duplicate filename is not due to some defaults as in the issue above. I hope we can get back to it during this week. Do you still have some logs in a machine which is experiencing it? I am not sure how easily reproducible the error is.
It is fairly easy to reproduce. It appears in every replay we run. Just tell me what you need I'm sure I can provide it for you.
In order to fix it, I think we need to start making the uuid a function of:
It should be possible to adapt it with uuid5
algorithm. More info at: https://docs.python.org/3/library/uuid.html
@khurtado this one is tightly coupled to https://github.com/dmwm/WMCore/issues/9011 , and I think that the fix you will propose will actually close this and #9011 issues. I assigned this one to you and moved to Work in progress as well.
Thanks Alan!
Here are my findings:
At runtime, we make a PSet tweak to change the name of the output files based on the output modules:
lfn = "%s/%s/%s.root" % (lfnBase, lfnGroup(job), modName)
result.addParameter("process.%s.logicalFileName" % modName, lfn)
So, the output files have a pattern like:
/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/RAWSIMoutput.root
After cmsRun is done, we change the lfn of the file, and with that the filename at stageout. So the file then looks like:
/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/2AE85F14-94A1-EC11-BBF5-FA163EC7AA59.root
But here is the thing, we don't set thee uuid
for the filename. We basically grab the GUID
from the generated Framework XML job report here:
The guid generated by FWCore: https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/Guid.cc#L18-L28
And we can't just change the file lfn
to use our own uuid
for the filename, since we also enforce and hceck the guid
in the filename using this utility:
https://github.com/cms-sw/cmssw-wm-tools/blob/master/bin/cmssw_enforce_guid_in_filename.py
So it seems the best solution here, is reporting the issue to cmssw and have it fixed on that end.
@germanfgv Do you happen to have or know anyone with direct access to the T2_CH_CERN
worker nodes?
Specifically the nodes from here:
[khurtado@lxplus708 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs]$ find . -name "condor*out" -exec grep -H Hostname {} \; | sort
./g1/job_1038/condor.41731.37.out:Hostname: b7g18p9798.cern.ch
./g1/job_1048/condor.41731.47.out:Hostname: b7g18p9798.cern.ch
./g2/job_98/condor.41726.97.out:Hostname: b7g18p7310.cern.ch
./g2/job_99/condor.41726.98.out:Hostname: b7g18p7310.cern.ch
./g3/job_1137/condor.41731.136.out:Hostname: b7g17p4406.cern.ch
./g3/job_1138/condor.41731.137.out:Hostname: b7g17p4406.cern.ch
./g4/job_1169/condor.41731.168.out:Hostname: b7g18p3673.cern.ch
./g4/job_1170/condor.41731.169.out:Hostname: b7g18p3673.cern.ch
./g5/job_527/condor.41729.133.out:Hostname: b7g10p4995.cern.ch
./g5/job_530/condor.41729.136.out:Hostname: b7g10p4995.cern.ch
./g5/job_531/condor.41729.137.out:Hostname: b7g10p4995.cern.ch
./g6/job_4623/condor.41780.47.out:Hostname: b7g17p1733.cern.ch
./g6/job_4635/condor.41780.59.out:Hostname: b7g17p1733.cern.ch
CSSW is asking to check the availability of /dev/urandom
and contents of /proc/sys/kernel/random/entropy_avail
, which I still don't know if they change inside containers though, but it should be trivial to check for this after invoking singularity.
@germanfgv From your logfiles on:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs
Would you be able to get a pair of job logfiles with outputs other than DQMIO?
For example, your g3
or g4
directories do have RAW, ALCARECO in the list of duplicated files in the json, but only the DQMIO job logfiles are present.
The cmssw folks are requesting this on: https://github.com/cms-sw/cmssw/issues/37240
To summarize: It looks like the DQMIO issue is understood, but they need more information for the ALCARECO, RAW, etc.
@khurtado please check the logs that you can find here:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4/job_1106
I'll check if we have more logs. If not, we can try and generate more examples.
Hi @germanfgv. So, for g4
, I can see the duplicated DQMIO file from job 1169 and 1170, but not 1106:
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep DQMIO
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root
I couldn't find duplicated names for ALCARECO and RAW though, am I perhaps looking in the wrong way?
# ALCARECO
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep ALCARECO
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/98e014a9-8224-49b9-b5f9-5a77fca89a16.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/b0e5793e-1d8e-4a04-a5c4-f68ea475692e.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root
# RAW
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep RAW
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/40ec5397-771e-4478-91d0-45e7c63aec5d.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/b22d070b-8f19-4811-a9a3-81d202072653.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$
According to file g4/dupPickles.json
, line 11157, these are the dub LFNs:
"dup_lfns": [
"/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root",
"/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root",
"/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root",
"/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root",
"/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root",
"/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root",
"/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-c2912657-08db-4d3d-9f8a-a8c949da8c68-0-logArchive.tar
.gz",
"/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-c2912657-08db-4d3d-9f8a-a8c949da8c68-0-logArchive.tar
.gz"
]
Back in the day, I also couldn't find them directly, that's why I just copied the logs for the DQM jobs. Maybe @amaltaro can clarify to us what this "dup_lfns" list means exactly.
@germanfgv That's a good point. Looking at all the guids in the json for g4
, I could only find duplicated guids for 2 DQMIO files.
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep guid dupPickles.json | awk '{print $2}' | sort | uniq -c 2>&1| grep -v " 1"
2 "CFACDE90-2C31-11EC-A88A-C5C08E80BEEF",
2 "D372A94C-2C31-11EC-AA52-D0C08E80BEEF",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep CFACDE90-2C31-11EC-A88A-C5C08E80BEEF dupPickles.json | grep PFN
"OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root?eos.app=cmst0",
"OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root?eos.app=cmst0",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep D372A94C-2C31-11EC-AA52-D0C08E80BEEF dupPickles.json | grep PFN
"OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/D372A94C-2C31-11EC-AA52-D0C08E80BEEF.root?eos.app=cmst0",
"OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/D372A94C-2C31-11EC-AA52-D0C08E80BEEF.root?eos.app=cmst0",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$
Maybe what happens is whenever there is a SQL WMBS error, dupLFNs lists all the LFNs listed in the job parameters?
E.g.: SQL error below, so listing all LFNs from parameters
@amaltaro?
[SQL: INSERT INTO wmbs_file_details (id, lfn, filesize, events,
first_event, merged)
VALUES (wmbs_file_details_SEQ.nextval, :lfn, :filesize, :events,
:first_event, :merged)]
[parameters: [{'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00000/92b873a5-603b-446c-b50e-4aebb8441650.root', 'filesize': 1436612, 'events': 2236, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00000/D960EA12-2C2C-11EC-B296-53878E80BEEF.root', 'filesize': 135147, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/0059b267-4a6a-4ce3-9313-1a008b695744.root', 'filesize': 382715936, 'events': 2236, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-fa0ff51e-0a73-4568-85e9-01ca1ccb896c-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00000/be5d2ccc-6f8e-476d-921a-4aa925743197.root', 'filesize': 1479816, 'events': 2306, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00000/E4637F92-2C2C-11EC-A75F-040011ACBEEF.root', 'filesize': 135151, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/c4b38c53-0fc0-440d-a387-abba8c6d1017.root', 'filesize': 394714912, 'events': 2306, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-fcb3597a-8a81-4f02-85e9-ffba73866d56-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0} ... displaying 10 of 64 total bound parameter sets ... {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/44820b14-9c79-405e-9188-47ba12f864c5.root', 'filesize': 395093657, 'events': 2308, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-4b84d92e-b659-4339-a913-09fdc47dc356-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}]]
By the way, if that is the case, it would be good news, as they already know how to fix DQMIO, which uses an old uuid algorithm, but were surprised the others were also wrong, as they use a new algorithm.
If I remember correctly - and the docstring is correct - this script is loading all the X pickle reports in the tail of the component log, listing the output files on those report files and comparing them against the files known to WMBS tables. At some point I think I also added a check for multiple job reports - from memory - with the same output LFN.
Just in case, here is the source code: https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py
If you spot any mistake, I would be glad to follow up and get it fixed :-D
@amaltaro Here: https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py#L48-L49
I think it's assuming logFiles
will throw unique pkl paths. However, if I look into a tier0 ComponentLog example, I do see some pkls being shown more than once. E.g.:
[khurtado@vocms047 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobAccountant]$ grep '/data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl' ComponentLog
2022-03-12 03:15:56,422:139997933635328:INFO:AccountantWorker:Handling /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
2022-03-12 05:08:43,681:140397941163776:INFO:AccountantWorker:Handling /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
Or:
[khurtado@vocms047 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobAccountant]$ tail -n 500000 ComponentLog | grep 'install\/tier0\/JobCreator\/JobCache' | awk '{print $3}' | sort | uniq -c | grep -v '1 '
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_11/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_12/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_13/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_15/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_19/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_20/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_22/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_23/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_24/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_25/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_27/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_28/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_29/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_30/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_31/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_33/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_34/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_372/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_373/Report.0.pkl
2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_375/Report.0.pkl
So I think that will make lfn2PklDict
to sometimes have 1 lfn
with more than 1 pkl path that is in reality the same pkl path. And that is why in such cases, all output files from the plk path are shown (ALCARECO, DQMIO, RAW, Log-tarball)
Thanks for this investigation, Kenyi.
I think you are right and I have just pushed in a commit to fix this issue: https://github.com/amaltaro/ProductionTools/commit/af153ac26bea0d6437e24227719338247826007c
@germanfgv next time you need to run it, please make sure to fetch the latest master/head version.
Thank you @amaltaro ! @germanfgv @jhonatanamado Could you guys please submit another REPLAY to reproduce this error and check with the new change in the removeDupJobAccountant.py that Alan just pushed on github?
This is to confirm whether we are only seeing issues with DQMIO or more than that. If only DQMIO are seen, this should be an easy fix for cmssw, as they just new to make DQMIO point to the new guid algorithm that other modules are already using.
@germanfgv @jhonatanamado Just wondering if you got the chance to try another replay. Let me know if you need any additional info on this matter.
Hello @khurtado . Im deploying a new replay and will give you the new results asap.
Hello @khurtado ,
Kenyi you will find two logs here /afs/cern.ch/user/j/jamadova/public/WMCore/JobAccountant
The componentlog and the log of the removeDupJobAccountant.py
with the changes proposed by Alan.
Let me know if you need more info.
Hi @jhonatanamado Thanks! So if I understand correctly
Found 406 unique pickle files to parse with a total of 319 output files and 1 duplicated files to process among them.
Duplicate files are:
['/store/unmerged/data/Tier0_REPLAY_2022/StreamCalibration/DQMIO/Express-v5/000/345/755/00000/2AA061C6-AB32-11EC-ADE7-B9C08E80BEEF.root']
See dupPickles.json for further details ...
Can we automatically delete those pickle files? Y/N
Y
Deleting /data/tier0/srv/wmagent/3.0.3/install/tier0/JobCreator/JobCache/Express_Run345755_StreamCalibration_Tier0_REPLAY_2022_ID220324053350_v5_220324_0535/Express/JobCollection_1_0/job_906/Report.0.pkl ...
Done!
Now loading all LFNs from wmbs_file_details ...
Retrieved 60594 lfns from wmbs_file_details
Only 1 DQMIO was found, right? @amaltaro Do you think we need more tests? EDIT: For the recored, after talking to Alan, we are considering DQMIO the only issue now. If we spot issues with other modules in the future, we can create another issue with cmssw
Hi @khurtado , Yes JobAccountant starts with this issue with that file, I only deployed the replay and let hits only this first exception. The replay could find more duplicates files as we are used to see it. I only posted the first exception due that we are running a cronjob for all the machines (including Production Agent) restarting this component every certain time. Do you want a full replay and check which other files are affected after the deletion of the duplicate file takes place and restarting the component?
@jhonatanamado I have already asked cmssw to fix the DQMIO issue. As things are now and with the current tests, it seems that is the only output module problem, so let's wait for that to be fixed and if you spot more duplicated LFNs from other modules in the future, let us know.
@khurtado So far I have not been unable to find examples of duplicate files other than DQMIO. Not in replays not in production. I have seen records after around 20 JobAccountant duplicate file errors, and all of them were DQMIO files.
Hi guys, we need to fix it asap. It affects Tier0 operations and detector commissioning. @khurtado, could you please point me to an issue that can be tracked with the CMSSW release managers, where you requested the problem to be fixed? If it was a private communication, who you contacted and what is the expectation for the it will be fixed?
@drkovalskyi Yes, here it s: https://github.com/cms-sw/cmssw/issues/37240
Thanks Kenyi.
@germanfgv @drkovalskyi : Was this fixed with https://github.com/cms-sw/cmssw/issues/37240 ? Can this issue be closed or is there anything needed from WMCore?
EDIT: It was reported during the WMChat meeting that no new occurrences have been seen since the fix, so closing this ticket.
Impact of the bug T0Agent
Describe the bug Two or more jobs create output files with the same name. JobAccountant tries to add them to the DB and fails due to a
ORA-00001: unique constraint (WMBS_FILDETAILS_UNIQUE) violated
We have seen the issue affecting Express jobs, but it may be affecting Repack and PromptReco too.
How to reproduce it Deploy a Tier0 replay with a significant amount of jobs. As soon as the firsts batches of Express jobs finish, the JobAccountant will crash.
Expected behavior Each job should create files with unique names
Additional context and error message Here you can find the full error message of the component ComponentLog.txt
Full JobAccountant logs can be found here:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/JobAccountantLogs
Logs of a set of jobs generating files with the same name:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs