CMSCompOps / WmAgentScripts

CMS Workflow Team Scripts
7 stars 51 forks source link

Create&Assign a backfill with custom job splitting for T3_US_ANL #793

Open haozturk opened 3 years ago

haozturk commented 3 years ago

Impact of the new feature HPC sites - T3_US_ANL in particular

Is your feature request related to a problem? Please describe. There is a new HEPCloud site for which custom settings are required in workflow assignment. One of these custom settings is job splitting. This site requires job with 6h wall time - not more. That's why we need to adjust the job splitting such that we create smaller jobs that can fit into this site.

Describe the solution you'd like Create a tool/script which helps us to customize job splitting in workflow assignment.

Describe alternatives you've considered None

Additional context Currently the issue is discussed on Slack, I will update the issue as we proceed. @z4027163 @amaltaro @todor-ivanov @drkovalskyi FYI

haozturk commented 3 years ago

This is the backfill: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_TSG-Phase2HLTTDRWinter20GS-Backfill-00276__v1_T_210223_202244_6379

In order to adjust the job splitting, I used the ReqMgr GUI and divided the events_per_job by 8. Another details is that I used the hepcloud team instead of backfill. Since backfill agents are at CERN and cannot work with the T3_US_ANL site.

haozturk commented 3 years ago

This backfill had some issues due to wallclock time constraints. This time we will try again by diving the events_per_job by 20, thus we'll get 2000/20=100 events per job and monitor its status.

New backfill: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_TSG-Phase2HLTTDRWinter20GS-Backfill-00276__v1_T_210324_191726_5368

haozturk commented 3 years ago

We assigned one production workflow to ANL https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_TSG-SnowmassWinter21wmLHEGEN-00008__v1_T_210330_084302_349 The job splitting is adjusted such that the events_per_job is divided by 30: 9000 --> 300

haozturk commented 3 years ago

These workflows have failed due to various issues. Dirk applied some fixes on the site end, so we resubmitted the workflows:

The prod workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_TSG-SnowmassWinter21wmLHEGEN-00008__v1_T_210421_122144_910 Backfill: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_TSG-Phase2HLTTDRWinter20GS-Backfill-00276__v1_T_210421_121706_1714