dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Duplicate lumisection in single workflow #7371

Closed vlimant closed 5 years ago

vlimant commented 8 years ago

For this workflow which is the only one writing in its output

pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417

event completion real 1999031 expected 2000000 event completion real 1997739 expected 2000000 pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417 has duplicates { "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM": false, "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/LHE": true } { "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM": {}, "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/LHE": { "0:7592": [ "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root", "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root" ], "0:7591": [ "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root", "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root" ] }

it's the second instance in not too long so there must be something wrong @prozober

ticoann commented 8 years ago

Ok, but one of these merge jobs had an input that was an output of one of the files in 1) and there is a match in file name but a mismatch in lumis ?

Yes that is correct. FJR for unmerged file contains lumi [7499, 7500]. but in the merge job FJR for the same file, lumi is reported as [7591, 7599].

If so, sometime between the job in 1) writing it's output and the merge job in 2) reading it's input the content of the file changed. Which is a bit scary... Apart from the name, does other meta-data match ? Like number of events written by the job in 1) and number of events read by the merge job in 2) ?

Yes, event matches. But this is MC (and by definition of splitting algorithm), basically all the jobs has the same events. The reason I suspect of race condition was the unmerged file from other job which contains [7499, 7500] is located in the same directory and I was confused about cmssw creating the generic name of the file directly to that directory and WMAgent renaming it which doesn't seems to be possible after Dirk pointed out.

Either

  1. something overwrites the initial unmerged file before merge job ran.
  2. Or FWJ report has wrong information about the input file(unmerged) in the merge job. I would think it is more plausible 1. case but not sure how could that happen.

@Dr15Jones, Chris told me, that we can actually check what lumis the file contains by examining actual file, right?

Following is the merged file which contains mismatching (between unmerged and merged) lumi case. Can we check whether this actually contains [7591, 7599] or [7499, 7500]?

/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

Dr15Jones commented 8 years ago

On Nov 14, 2016, at 10:24 PM, ticoann notifications@github.com wrote:

Either

  1. something overwrites the initial unmerged file before merge job ran.
  2. Or FWJ report has wrong information about the input file(unmerged) in the merge job. I would think it is more plausible 1. case but not sure how could that happen. The log file for the merge job and the FWJ both say the file being read have the unexpected lumi blocks.

@Dr15Jones, Chris told me, that we can actually check what lumis the file contains by examining actual file, right?

Following is the merged file which contains mismatching (between unmerged and merged) lumi case. Can we check whether this actually contains [7591, 7599] or [7499, 7500]?

/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

One can get the list of Run/Lumi/Events in the file by doing

edmFileUtil -e /store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

However, my attempt to do so gives this error Error in TNetXNGFile::Open: [ERROR] Server responded with an error: [3011] Unable to open file /eos/cms/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root; No such file or directory

ERR Could not open file root://eoscms.cern.ch//eos/cms/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

Dr15Jones commented 8 years ago

Hi everyone,

On Nov 15, 2016, at 10:08 AM, Chris Jones ChrisDJones15@gmail.com wrote:

@Dr15Jones, Chris told me, that we can actually check what lumis the file contains by examining actual file, right?

Following is the merged file which contains mismatching (between unmerged and merged) lumi case. Can we check whether this actually contains [7591, 7599] or [7499, 7500]?

/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

One can get the list of Run/Lumi/Events in the file by doing

edmFileUtil -e /store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

edmFileUtil -e root://cmsxrootd-site.fnal.gov//store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root | grep '(Lumi)' >& lumis.log

I was eventually able to get this to work from CERN. The relevant bit of the output is below:

          1           7496              0            689 (Lumi)
          1           7497              0            690 (Lumi)
          1           7498              0            691 (Lumi)
          1           7591              0            692 (Lumi)
          1           7592              0            693 (Lumi)
          1           7501              0            694 (Lumi)
          1           7502              0            695 (Lumi)

The second column is the LuminosityBlock number and the fourth column is the index in the file to find that lumi. The workflow management system gives the files to merge in what it think is increasing LuminosityBlock number. You can see that '7591' and '7592' are not in the proper order because the workflow system thought the input file being read had luminosity block 7499 and 7500, which it obviously does not.

hufnagel commented 8 years ago

Ok, this is a bit scary. Question, were both the "fake" lumi 7499/7500 and 7591/7592 files located at the same site ? Ie. is it possible the local storage system gave us the wrong file ? That's about the only thing I can think off. According to all the records we can access from our end, it really looks like we wrote file A from job 1 and then read file A from job 2, except it wasn't the same file, but a different file B, even though we specified file A in the job config.

Everyone agree with this interpretation ?

If this was indeed at the same site, any chance to ask the site about logs for the two files (creation, access, deletion, general meta-data ?)

ticoann commented 8 years ago

Everyone agree with this interpretation ?

Or initial unmerged file is over written after it is created and before merged job runs some how some where. 😄

hufnagel commented 8 years ago

Everyone agree with this interpretation ? Or initial unmerged file is over written after it is created and before merged job runs some how some where. 😄

Not a much better option :-). Either the filesystem gave us the wrong file or someone (not us from what we can tell) overwrote the file with another files content...

Either way, the only way I can see to even have a hope of understanding this is from the mass storage logs (if there are any).

Dr15Jones commented 8 years ago

Seems to me like we should find out all we can about the jobs which made the original files

ticoann commented 8 years ago

I have a couple of questions as well.

  1. Can a pilot run multiple jobs when pilot life time is longer then i.e 2 jobs running there? If 1. is correct statement,
  2. Are jobs running cuncurrently in the pilot?
  3. Are the workspaces shared between the jobs?
ericvaandering commented 8 years ago

On Nov 16, 2016, at 09:47, ticoann notifications@github.com<mailto:notifications@github.com> wrote:

I have a couple of questions as well.

  1. Can a pilot run multiple jobs when pilot life time is longer then i.e 2 jobs running there?

Yes

If 1. is correct statement,

  1. Are jobs running cuncurrently in the pilot?
  2. Are the work space shared between the jobs?
  3. Does workspace is shared?

2 - yes, multiple jobs/pilot

3 and 3 - no, should not be

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dmwm/WMCore/issues/7371#issuecomment-260980625, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABmErsb4yWbDYLd5FDcysSOq-saleENhks5q-yV4gaJpZM4Kqd7G.

Dr15Jones commented 8 years ago
  1. Does workspace is shared? 3 and 3 - no, should not be

I guess then what is the exact mechanism used to separate them?

hufnagel commented 8 years ago

Workspace is separate for the payloads. Pilot creates different working directories per payload.

Dr15Jones commented 8 years ago

But how does the pilot name those directories?

hufnagel commented 8 years ago

For example /home/glidein_pilot/glide_iHWb9z/execute/dir_25339 (Tier0 job). Last level is payload specific, there are three of them in this pilot for the three currently running jobs.

hufnagel commented 8 years ago

And the CMSSW working directory is

/home/glidein_pilot/glide_iHWb9z/execute/dir_25339/job/WMTaskSpace/cmsRun1

every step has it's own working directory.

Dr15Jones commented 8 years ago

Assuming after dir_ is the job id, what if the job ID wrapped back around during processing?

hufnagel commented 8 years ago

It's not the job id, it's some internal pilot counter. GlideInWMS has not idea what is in the payload. And it doesn't wrap around.

hufnagel commented 8 years ago

Anyways, this is checkable. Job condor logs or maybe even the logArchive will contain the working directory for the job.