dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

CMS@Home jobs not being cleaned up #10751

Open KatyEllis opened 3 years ago

KatyEllis commented 3 years ago

Impact of the bug Cleanup from the CMS@Home jobs is not being done properly since turning off Phedex.

Describe the bug In an email from Ivan Reid on the 12th August 2021, he describes the issue. Cleanup jobs are trying to delete files from EOS instead of the DataBridge. Similarly the merge log files have the same problem. The DataBridge storage does get garbage collected after 30 days, but this space could fill up before 30 days is reached.

How to reproduce it Steps to reproduce the behavior:

Expected behavior Change the (hardcoded?) target of the cleanup jobs?

Additional context and error message From Ivan's email: (removing triangular brackets due to syntax requirements)

Our current understanding is that in /cvmfs/cms.cern.ch/SITECONF/T3_CH_CMSAtHome/JobConfig/site-local-config.xml we have in stage-out: se-name value="srm-eoscms.cern.ch"/ phedex-node value="T2_CH_CERN"/ while in event-data we have catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=http"/ in which lfn-to-pfn protocol="http" path-match="/+store/(.*)" result="http://data-bridge.cern.ch/myfed/cms-output/store/$1"/>

ivanreid commented 3 years ago

Actually, the problem has always been there, even before Phedex was superseded. Failures can be verified at https://cmsweb-testbed.cern.ch/wmstats/index.html, filtering the workflow with "ireid_TC_SLC7".

ivanreid commented 1 year ago

Hello again. Does anyone have the time to look at this at the moment? We've moved on a fair bit in the interim and are quite close to being "production ready".

amaltaro commented 1 year ago

@ivanreid Hi Ivan, apologies for the delay on this, but other issues that actually involve production services are taking precedence over this one.

Can you help me on understanding and clarifying this issue? Some of the questions I have is: 1) I understand that Production/Processing jobs run on T3_CH_Volunteer resources, right? What is the phedex-node name defined in site-local-config.xml? 2) all the remaining job types (tasks) run on T3_CH_CMSAtHome resources, right? And I see that the phedex-node name points to T2_CH_CERN. Is it correct?

Looking into: ireid_TC_SLC7_IDR_CMS_Home_230102_183019_9782

I see that:

ivanreid commented 1 year ago

@amaltaro That's OK Alan, I know there are more important issues.

I understand that Production/Processing jobs run on T3_CH_Volunteer resources, right? What is the phedex-node name defined in site-local-config.xml?

Correct. The phedex-node is 'phedex-node value="T3_CH_CMSAtHome"'

all the remaining job types (tasks) run on T3_CH_CMSAtHome resources, right? And I see that the phedex-node name points to T2_CH_CERN. Is it correct?

Yes, precisely.

We've never understood why the LogCollect for Merge GenSimFull files tries to run on T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT. Presumably it's a cascading effect down the siteconfig chain.

Cleanup for unmerged GenSimFull files fails because it tries to delete from EOS rather than DataBridge. We suspect that this is because we are the only site where the unmerged files and merged files are in different storage locations, but are puzzled that GenSimFullMerge gets the source location right while cleanup doesn't.

amaltaro commented 1 year ago

Thanks Ivan!

We've never understood why the LogCollect for Merge GenSimFull files tries to run on T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT. Presumably it's a cascading effect down the siteconfig chain.

We need to check the logs, but I think the reason is the following: Merge jobs run at T3_CH_CMSAtHome, which use and report T2_CH_CERN as its phedex-node name (RSE). So, the logArchive file created by the Merge job is theoretically available at T2_CH_CERN. By design, the agent does the right thing, which is, sending a job to the sites associated to a given storage/RSE.

Cleanup for unmerged GenSimFull files fails because it tries to delete from EOS rather than DataBridge. We suspect that this is because we are the only site where the unmerged files and merged files are in different storage locations, but are puzzled that GenSimFullMerge gets the source location right while cleanup doesn't.

these Cleanup jobs are executed at T3_CH_CMSAtHome, because their input files is meant to be available at the "storage" T3_CH_CMSAtHome.

Now the question is, why Merge job on T3_CH_CMSAtHome works; but Cleanup on T3_CH_CMSAtHome does not? They have exactly the same use case, both reading data in - from the DataBridge(?) Maybe CMSSW uses the right path - read from the storage.xml - while the Cleanup jobs don't. That is just a guess though.

ivanreid commented 1 year ago

I once tried to follow the logic of CMSSW reading/parsing the siteconf files -- it seemed to make a lot of hard-wired assumptions.

ivanreid commented 1 year ago

@amaltaro : It's been suggested that we look into this ourselves since the WMCore group is overloaded, and make suggestions as to possible solutions. I'm interested in why the Merge processes successfully read from DataBridge, but the CleanupUnmerged looks for the files on EOS instead. The code is fairly labyrinthine to the uninitiated, so could you please tell us which modules are responsible for the merging and the cleanup? Thanks.

amaltaro commented 1 year ago

@ivanreid apologies for the delay on getting back to this thread.

I recompiled everything that we discussed so far to better understand this/these problems. Here is a fresh summary of the issue:

  1. Production/Processing jobs run on T3_CH_Volunteer resources, which has site-local-config configured to: phedex-node=T3_CH_CMSAtHome.
  2. All the remaining jobs/tasks run on T3_CH_CMSAtHome resources, which has site-local-config configured to: phedex-node=T2_CH_CERN.

Looking into: ireid_TC_SLC7_IDR_CMS_Home_230102_183019_9782, we can observe the following:

I copied a recent log for a cleanup job from vocms0267, you can find it here: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_10751/ and I think the reason it fails is that it uses the Stage Out configuration (which points to T2_CH_CERN, since it's running in a CMSAtHome resource) to actually delete files in the storage. Command executed is: """ Command : xrdfs eoscms.cern.ch rm /eos/cms/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00007/9C44AA05-CE1F-564F-8043-4C784FE1B1D5.root """

LogCollect for MergeGenSimFull fail because their input data is available at T2_CH_CERN storage, hence jobs are submitted to the T2_CH_CERN computing sites. Given that this pool has no access to those resources, they sit there idle until they get killed by the agent for a too long period in Idle status (5 days).

The module responsible for deleting files from the storage (Cleanup jobs) is: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/DeleteMgr.py

Please let us know in case further assistance is needed. And thank you for looking into it!

ivanreid commented 1 year ago

@amaltaro Could you please also make available a log file from a successful merge job? Thanks.

amaltaro commented 1 year ago

@ivanreid sure, it's now available under the same web area as mentioned above. Note that merge tarball has multiple condor* logs, reason is that that job failed in the first try (901124) and then succeeded in its first retry (901129). The logfile wmagentJob.log is unique though and it gets overwritten between job retries. HTH.

ivanreid commented 1 year ago

My first impression on reading the log files is that DeleteManager.py gets the PNN from the siteconf local-stage-out (lines 80-124), which we have pointed to T2_CH_CERN, but the merge jobs use the event-data instead (i.e. the data-bridge), lines 189-207 in SiteLocalConfig.py. So the cleanup jobs are ignoring event-data.

ivanreid commented 1 year ago

I haven't been able to work out exactly where/how the event-data tfc is passed to the merge jobs, the python code is rather more advanced than my pyRoot coding, but it obviously is since the right files are merged. The question probably now becomes, should DeleteManager just blithely assume that files to be deleted are residing in LocalStageOut?

ivanreid commented 1 year ago

OK, so the merge jobs are actually CMSSW (cmsRun) jobs, so that might explain why they pick up the event-data tfc. By running the following awk code BEGIN { ed=0; lso=0;} /site name/ { site=$0;} /<event-data>/ { ed=1; ecat=" ";} /<\/event-data>/ { ed=0;} /<local-stage-out/ { lso=1; lcat=" ";} /<\/local-stage-out>/ { lso=0; if (ecat != lcat) {print site " Catalogue files are different:"; print ecat; print lcat; print " ";}} (ed) && /catalog/ { split($2,a,"?"); ecat=a[1];} (lso) && /catalog/ { split($2,a,"?"); lcat=a[1];} over /cvmfs/cms.cern.ch/SITECONF/T*/JobConfig/site-local-config.xml, I find differences between the event-data tfc and local-stage-out tfc for T0_CH_CSCS_HPC, T2_BE_UCL, T2_ES_IFCA, T3_CH_CERN_HelixNebula_REHA, T3_CH_CMSAtHome, T3_RU_SINP, T3_US_ANL, T3_US_Anvil, T3_US_Baylor, T3_US_Colorado, T3_US_Lancium, T3_US_NERSC, T3_US_NEU, T3_US_OSG, T3_US_PSC, T3_US_PuertoRico, T3_US_Rice, T3_US_Rutgers, T3_US_SDSC, T3_US_TACC, T3_US_UIowa, T3_US_UMD and T3_US_UMiss -- although some of these are due to my simplistic parsing. [Edit: Further investigation with grafana shows that most of the T3 sites are either not monitored, or only used for analysis jobs.]

So, it's probable that changing the cleanup jobs to use the event-data tfc might affect other sites as well. We could possibly put in a conditional section depending on the site name, but that seems inelegant to me.

ivanreid commented 1 year ago

For discussion, here's a possible but inelegant solution to our immediate problem. Treat it as pseudo-code as I'm not sure if I've interpreted all the classes correctly, or if their members are available at the time. It also ignores possible problems downstream (e.g. when the unmerged mergeLog files in eoscms.cern.ch//eos/cms/store/unmerged/logs/prod/ are to be cleaned up).

[lxplus710:WMCore] > git diff diff --git a/src/python/WMCore/Storage/DeleteMgr.py b/src/python/WMCore/Storage/DeleteMgr.py index ff3df65..cc9451a 100644 --- a/src/python/WMCore/Storage/DeleteMgr.py +++ b/src/python/WMCore/Storage/DeleteMgr.py @@ -101,7 +101,11 @@ class DeleteMgr(object): msg += "Unable to perform StageOut operation" raise StageOutInitError(msg) msg += "Local Stage Out PNN to be used is %s\n" % pnn - catalog = self.siteCfg.localStageOut.get("catalog", None) + + if self.siteName == "T3_CH_CMSAtHome": + catalog = self.siteCfg,eventData.get("catalog", None) + else: + catalog = self.siteCfg.localStageOut.get("catalog", None) if catalog == None: msg = "Unable to retrieve local stage out catalog\n" msg += "From site config file.\n"

ivanreid commented 1 year ago

On the basis that "it's easier to ask for forgiveness than permission", Federica and I applied the suggested patch on our agent on vocms0267, and started a new workflow. However, my fears that I might have got class membership/hierarchy wrong have been borne out. The cleanup jobs are now failing with the error "'DeleteMgr' object has no attribute 'siteName'". Back to the drawing-board...

amaltaro commented 1 year ago

@ivanreid thank you for this investigation so far. I am not saying this is the right way to resolve this issue, but if I follow the logic of the code, we can say.

If there is no override, this code (or a few lines below) get triggered: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/DeleteMgr.py#L72

which ends up creating an instance of SiteLocalConfig: https://github.com/dmwm/WMCore/blob/fea1de4119a9091233f3e6414f33f069971b5a74/src/python/WMCore/Storage/SiteLocalConfig.py#L54

and that is the object returned to the DeleteMgr module. Inspecting the SiteLocalConfig object, we can see siteName here: https://github.com/dmwm/WMCore/blob/fea1de4119a9091233f3e6414f33f069971b5a74/src/python/WMCore/Storage/SiteLocalConfig.py#L75

So, your if statement above should likely be replaced by:

if self.siteCfg.siteName == "T3_CH_CMSAtHome":

This will not work though for scenarios where DeleteMgr is invoked with an override setup: https://github.com/dmwm/WMCore/blob/fea1de4119a9091233f3e6414f33f069971b5a74/src/python/WMCore/Storage/DeleteMgr.py#L126

which I do not know by heart what are those, but I am sure it's either used by specific tasks in central production, or by the T0 team.

ivanreid commented 1 year ago

@amaltaro Yes, I saw there was an override provision, but haven't tried to track down where it's used. I had already changed the if statement as you suggested, now we're back to trying to delete from eos again. I've asked Federica to put more diagnostics into our patched DeleteMgr.py to try to better understand what is happening. I've also explicitly put T3_CH_CMSAtHome into our event-data path in the site-config rather than 'local', but that doesn't appear to be having any affect.

ivanreid commented 1 year ago

Unfortunately, the suggested patch does not work, because the variable catalog is not used! Instead a call is made to self.siteCfg.trivialFileCatalog() which explicitly returns localStageOut.

fanzago commented 1 year ago

Hi Alan, do you know if there is already a way to force the usage of fallback information instead of the info in locatStageOut of self.siteCfg obj for the delete command? Because it seems that there are no override in our cleanup jobs. But if we are able to use the information provided by self.siteCfg.fallbackStageOut, we should be able to delete files from bridge. So the solution (to try) is to explicitly force the usage of fallback information Otherwise I think we have to modify the initialiseSiteConf function of DeleteMgr.py in order to skip the self.tfc = self.siteCfg.trivialFileCatalog() call and to provide the needed info in other way.

fanzago commented 1 year ago

Dear Alan, the "patch" we have added in the DeleteMgr.py file of WMAgent, about the Cleanup step executed in T3_CH_CMSAtHome site, works well. Currently we don't have failures in our workflows. We "forced" the usage of the information provided by the site-local-config.xml instead of the usage of the info that are used as default in the cleanup step. In order to do it, in the init of the DeleteMgr.py we check if the site name is T3_CH_CMSAtHome and in this case, we call the new initialiseCmsAtHome function instead of the original one initialiseSiteConf. For sure this is not the best patch because it depends on the site name. But in order to avoid this patch and make the cleanup step working for all the sites, we think the logic of the DeleteMgr.py should be changed:

amaltaro commented 1 year ago

@fanzago Federica, thank you for this investigation.

Even if we don't merge your contribution into master, it's good to have it in the format of a PR because we can easily patch the CMS@Home agent.

For a more permanent fix, we can try to make it a priority for the upcoming quarter and have a closer look into it. We should get started with that planning in the coming days.