Model for secondary data placement

amaltaro commented 4 years ago

Impact of the new feature MSTransferor

Is your feature request related to a problem? Please describe. Yes, it's related to this JIRA ticket: https://its.cern.ch/jira/browse/CMSCOMPPR-12263

In short, depending on the workflow setup, we might end up with unprocesseable GQEs due to the lack of intersection between primary and secondary data location. Primary dataset gets distributed in a block-basis all over the SiteWhitelist (sites with quota available); while the PU gets distributed at the dataset level, and we only care about having at least 1 replica of it.

Describe the solution you'd like It isn't clear what the best approach is. We could subscribe the pileup data in a block basis too, but then we might risk the physics content of the samples if too many jobs get to execute in a site holding a small fraction of it, where the same PU events would be used much more than if we had the whole dataset available.

Another possibility would be to make multiple replicas of the pileup dataset.

Yet another possibility would be to transfer the primary data only to places holding (part of) the pileup dataset.

Or perhaps we could overwrite the Trust flags post-assignment, to make sure that data can be pulled by the agents and data read remotely.

Any other options that I haven't thought yet?

UPDATE (post discussion): I think that will be the way to go. Make sure the PU dataset is available at at least one site within the SiteWhitelist, and put ALL the primary/parent blocks at the same location(s) that contains the PU. Workflows will take longer to run, but it's probably better than replicating 100TB or more of classic PU...

Describe alternatives you've considered a few above

Additional context https://its.cern.ch/jira/browse/CMSCOMPPR-12263

amaltaro commented 4 years ago

@nsmith- Nick, just in case you have any comments on this issue and another suggestion to deal with it.

todor-ivanov commented 4 years ago

Let me add some more info here (coming from another workflow [1], which may be considered as a subclass of the same type of workflows as the one from above). While the one mentioned earlier by Alan, needs to have a non null intersection of the following three sets:

Sitewhitelist
List of sites holding (at a dataset level) a full copy of the MinBias PileUp
List of sites holding (at a block level) parts or full copy of the Primary input

The Workflow from [1] is starting from GEN and is having NO input for the first step, as defined here [2]. Which reduces the required intersection to be only for:

Sitewhitelist
List of sites holding (at a dataset level) a full copy of the MinBias PileUp

This basically allowed the workflow to start running because, obviously, the conditions for step1 have been met [3]. But once the workflow progressed to Step3, where the PU was actually needed [2], it started failing all the jobs [4], besides the few that were actually running at T1_US_FNAL_DISK.

Another question to be brought up here is: The sites holding a full copy of the PU on disk are T1_UK_RAL_Disk and T1_US_FNAL_Disk. While for T1_US_FNAL a redirection to T1_US_FNAL_Disk has happened and few jobs managed to run there no such redirection happened for T1_UK_RAL. @amaltaro Which is the correct behavior here?

[1] https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_LUM-RunIISummer19UL16GEN-00003

[2] https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_LUM-RunIISummer19UL16GEN-00003__v1_T_200528_192314_7255

... 
 "Step1": {
      "ParentDset": null, 
      "ChildDsets": [
        "/SingleNeutrino/RunIISummer19UL16GEN-106X_mcRun2_asymptotic_v3_ext1-v2/GEN"
      ]
}

...
"Step3": {
    "KeepOutput": false, 
    "MCPileup": "/MinBias_TuneCP5_13TeV-pythia8/RunIISummer19UL16SIM-106X_mcRun2_asymptotic_v3-v1/GEN-SIM", 
...

[3] from DAS: - query="site dataset=/MinBias_TuneCP5_13TeV-pythia8/RunIISummer19UL16SIM-106X_mcRun2_asymptotic_v3-v1/GEN-SIM"

Site name: T2_US_UCSD
Block completion: 100.00% Block presence: 8.79% Dataset presence: 6.70% File-replica presence: 6.70% Site type: DISK StorageElement: bsrm-3.t2.ucsd.edu

Site name: T2_US_MIT
Block completion: 100.00% Block presence: 12.09% Dataset presence: 16.11% File-replica presence: 16.11% Site type: DISK StorageElement: se01.cmsaf.mit.edu

Site name: T1_US_FNAL_MSS
Block completion: 100.00% Block presence: 100.00% Dataset presence: 100.00% File-replica presence: 100.00% Site type: TAPE no user access StorageElement: cmsdcatape01.fnal.gov

Site name: T1_US_FNAL_Disk
Block completion: 100.00% Block presence: 100.00% Dataset presence: 100.00% File-replica presence: 100.00% Site type: DISK StorageElement: cmsdcadisk01.fnal.gov

Site name: T1_US_FNAL_Buffer
Block completion: 100.00% Block presence: 100.00% Dataset presence: 100.00% File-replica presence: 100.00% Site type: TAPE no user access StorageElement: cmsdcatape01.fnal.gov

Site name: T1_UK_RAL_Disk
Block completion: 100.00% Block presence: 100.00% Dataset presence: 100.00% File-replica presence: 100.00% Site type: DISK StorageElement: gridftp.echo.stfc.ac.uk

From Workflow definition:

"SiteWhitelist": [
    "T1_IT_CNAF", 
    "T2_DE_DESY", 
    "T2_US_Purdue", 
    "T2_FR_GRIF_LLR", 
    "T2_DE_RWTH", 
    "T2_FR_IPHC", 
    "T1_ES_PIC", 
    "T1_UK_RAL", 
    "T1_US_FNAL", 
    "T2_IT_Legnaro", 
    "T2_US_Caltech", 
    "T2_UK_London_Brunel", 
    "T2_IT_Pisa", 
    "T1_DE_KIT", 
    "T1_FR_CCIN2P3", 
    "T2_US_Florida", 
    "T2_FR_GRIF_IRFU", 
    "T2_UK_London_IC", 
    "T2_IT_Bari", 
    "T2_US_Nebraska", 
    "T2_FR_CCIN2P3", 
    "T2_US_UCSD", 
    "T2_ES_CIEMAT", 
    "T1_RU_JINR", 
    "T2_US_Wisconsin", 
    "T2_US_MIT", 
    "T2_BE_IIHE", 
    "T2_CH_CERN"
  ]

[4]

    cmsRun3
        Fatal Exception (Exit Code: 8029)

            An exception of category 'NoSecondaryFiles' occurred while
               [0] Constructing the EventProcessor
               [1] Constructing module: class=MixingModule label='mix'
            Exception Message:
            RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro commented 4 years ago

Thanks for providing another use case where this issue gets exposed, Todor. From the code, I expected your example to be covered in the agent and get jobs created to the correct intersection location, as done here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueue.py#L433

I'm going to run a few tests in preprod.

sharad1126 commented 4 years ago

@amaltaro Due to a lot of IO, we don't do xrootd for MinBias as it kills the xrootd n/w(classical mixing, if I am not wrong). So make sure in case you plan to change the TrustPUSitelist flag after assignment, it doesn't causes us any such issues.

vlimant commented 4 years ago

I cannot find the document passed to @vkuznet and @amaltaro back in the days of creation of MS transferor. unified has 1) placed the secondary where necessary (according to place available and campaign configuration) 2) locate the primary input dataset to sites holding enough copies of the secondary (when local read was required)

amaltaro commented 4 years ago

@vlimant the secondary data placement was performed at dataset level, right? Primary input was placed in a block-basis, and only to sites holding the pileup dataset (if classic PU).

Would you remember how it used to handled large PU datasets only available at a site or two? Would it assign the workflow only to those 2 sites (and place input primary blocks there)? Thanks

amaltaro commented 4 years ago

From today's meeting, it looks like we will have to couple primary blocks (and parent, if needed) to the same locations either holding the PU or that are getting a subscription of the PU dataset.

Perhaps we could even try to keep 2 copies of the pileup dataset if its size is under a given size (50TB?), such that we enforce 2 copies of the pileup within the workflow SiteWhitelist. Of course, depending on how workflows get assigned, we could very well end up with many more than 2 copies on the grid. Which makes me think that we could have a limit on the number of dataset copies as well, but would it be a hardwired limit, or would it float according to the PU dataset size?

nsmith- commented 4 years ago

I expect (hope) that the amount of workflows that read classical mixing secondary is small enough that we can accept intersecting the site whitelist with the (full) secondary dataset location to decide where to run. For premix pileup, I think no such restriction should exist--we have run jobs reading premix remotely from FNAL/CERN/etc. regularly without issue for some time now, no? I am not sure why this is suddenly coming up, or is this a problem only restricted to classical mixing where remote reading is not enabled in the campaign?

nsmith- commented 4 years ago

If this is just about placing secondary datasets that are not already on disk somewhere, then the secondary placement should probably be somewhat manual as it is infrequent, essentially once per new classical mixing campaign. Then all jobs in that campaign would need to be restricted to this small set of sites hosting the pileup.

amaltaro commented 4 years ago

I expect (hope) that the amount of workflows that read classical mixing secondary is small enough that we can accept intersecting the site whitelist with the (full) secondary dataset location to decide where to run.

I'm trying to find this out. Perhaps Scarlet would know it. (She already replied, it seems to be a small % of the work, perhaps around 10% of the premix wfs).

For premix pileup, I think no such restriction should exist--we have run jobs reading premix remotely from FNAL/CERN/etc. regularly without issue for some time now, no?

Yes, AFAIK premix workflows have a pretty good success rate reading data remotely.

I am not sure why this is suddenly coming up, or is this a problem only restricted to classical mixing where remote reading is not enabled in the campaign?

I think it got exposed with StepChain (stuck) workflows, because then we try to match SiteWhitelist + InputDataLocation + PileupDataLocation, and given that MSTransferor distributes blocks equally among the SiteWhitelist, it could be that blocks are not available at the same site with the full pileup dataset.

If this is just about placing secondary datasets that are not already on disk somewhere ...

It's actually making sure that - at least - 1 block of the pileup dataset is at the same site that holds one/many of the primary blocks.

From today's meeting, it looks like we will have to couple primary blocks (and parent, if needed) to the same locations either holding the PU or that are getting a subscription of the PU dataset.

From my previous comment. I think that that will be the way to go. Make sure the PU dataset is available at at least one site within the SiteWhitelist, and put ALL the primary/parent blocks at the same location(s) that contains the PU. Workflows will take longer to run, but it's probably better than replicating 100TB or more of classic PU...

todor-ivanov commented 4 years ago

Hi @amaltaro . One question from me, because I am afraid I may get lost here:

It's actually making sure that - at least - 1 block of the pileup dataset is at the same site that holds one/many of the primary blocks.

Aren't we talking about a pileup that needs to be fully present at a site where the workflow si supposed to run.

While investigating with Scarlet we found that for some reason (not yet clear what) some sites were having only some rando blocks of the pileup and we could see jobs there from one such workflow, but the failure rate was significant (I beleive due to failed tries to read the Pileup through xrootd)

There is of course a big chance that I am mixing different cases - this must never be excluded.

amaltaro commented 4 years ago

Yes Todor, in the end we want to have a full copy of the PU dataset at the same location, to avoid reusing the same events over and over when processing the primary signal. However, in terms of WMCore/WMAgent requirements, having one PU block in place is enough to let a workqueue element to go through the system and get processed by WMAgent. For instance, for our integration tests, we only use a block or two of the PU samples (in case they are not available at CERN/FNAL).

About the standalone blocks, I think Benedikt said that they were available on those sites because it's their origin site (where they got produced).

dmwm / WMCore

Model for secondary data placement #9745