CMSCompOps / WorkflowWebTools

https://workflowwebtools.readthedocs.io
1 stars 7 forks source link

"Apply to multiple" use cases #31

Closed paorozo closed 6 years ago

paorozo commented 7 years ago

Apply to multiple it is not a trivial task, we need to define somehow the use cases.

When do we want to apply an action to multiple workflows?

When we can cluster the workflows according to the failures they had, e.g.:

  1. A set of workflows is getting similar exit codes not matter the site (config issue).
  2. One site is in trouble and a set of workflows is getting the same exit codes at that site (site issue).
  3. Something in our system broke, and similar exit codes are spread over the workflows and sites (system issue). ...

What actions do we want to apply?

(1) If the config issue can be solved/avoided, take the action X. Action X should not touch the sites selected by default in the acdcserver. (2) Either we can do (1), or avoid the site in trouble unchecking it from the default list. It makes sense when we are dealing with one workflow, but not Apply to multiple because every workflow has different sites where it needs to run. We should somehow ban one or N sites in all the workflows. (3) In this case, we do as (1).

How should the action look like?

First use case Simple case, we are having reading and unreported failures at some sites.

My decision as the operator is to try a plain ACDC on both workflows. I am not going to change any parameter, and I am going to select pdmvserv_task_HIG-RunIISummer15GS-02210__v1_T_170316_214140_2984 in apply to multiple. The action looks like this:

{"pdmvserv_task_HIG-RunIISummer15GS-02107__v1_T_170316_212103_296": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"AllSteps": {"memory": ""}, "HIG-RunIISummer15GS-02107_0/HIG-RunIISummer15GS-02107_0MergeRAWSIMoutput/HIG-RunIISummer16DR80Premix-02516_0": {"sites": "T2_IT_Legnaro", "memory": ""}}}, "pdmvserv_task_HIG-RunIISummer15GS-02210__v1_T_170316_214140_2984": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"AllSteps": {"memory": ""}, "HIG-RunIISummer15GS-02107_0/HIG-RunIISummer15GS-02107_0MergeRAWSIMoutput/HIG-RunIISummer16DR80Premix-02516_0": {"sites": "T2_IT_Legnaro", "memory": ""}}}}

This is wrong, pdmvserv_task_HIG-RunIISummer15GS-02210__v1_T_170316_214140_2984 must not run at T2_IT_Legnaro.

dabercro commented 7 years ago

Okay, I can add some options:

paorozo commented 7 years ago

Thanks Dan, this is what we need. For the "Ban" option, please do not force the operator to select other sites, if he/she selects a "red" site, there must be a good reason.

dabercro commented 7 years ago

@prozober I will update #35 on the server tonight. Try it out tomorrow, if you can, and let me know if anything goes wrong.

paorozo commented 7 years ago

I applied an action to multiple; from pdmvserv_task_TRK-PhaseIFall16GS-00016__v1_T_170310_122236_4192 to pdmvserv_EGM-PhaseIFall16DR-00014_00022_v0__170324_201657_9223. The action looks a bit strange for me.

{"pdmvserv_task_TRK-PhaseIFall16GS-00016__v1_T_170310_122236_4192": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput": {"sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0": {"xrootd": "enabled", "sites": ["T1_ES_PIC", "T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_1/TRK-PhaseIFall16DR-00033_1MergeAODSIMoutput": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_1": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN", "T2_US_MIT", "T2_US_Nebraska"], "memory": ""}}}, "pdmvserv_EGM-PhaseIFall16DR-00014_00022_v0__170324_201657_9223": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_1": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN", "T2_US_MIT", "T2_US_Nebraska"], "memory": ""}, "StepOneProc": {"sites": ["T1_FR_CCIN2P3", "T1_IT_CNAF", "T1_US_FNAL", "T2_CH_CERN", "T2_US_Caltech", "T2_US_MIT", "T2_US_Nebraska", "T2_US_Purdue"], "memory": ""}, "StepOneProc/StepOneProcMergeALCARECOStreamEcalUncalZElectron": {"sites": ["T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_1/TRK-PhaseIFall16DR-00033_1MergeAODSIMoutput": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0": {"xrootd": "enabled", "sites": ["T1_ES_PIC", "T1_US_FNAL", "T2_CH_CERN"], "memory": ""}, "StepOneProc/StepOneProcMergeALCARECOoutput": {"sites": ["T2_US_Nebraska"], "memory": ""}, "TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput": {"sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}}}}

For example, here:

"pdmvserv_EGM-PhaseIFall16DR-00014_00022_v0__170324_201657_9223": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"TRK-PhaseIFall16GS-00016_0/TRK-PhaseIFall16GS-00016_0MergeRAWSIMoutput/TRK-PhaseIFall16DR-00033_0/TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput": {"xrootd": "enabled", "sites": ["T1_US_FNAL", "T2_CH_CERN"], "memory": ""}

The task TRK-PhaseIFall16DR-00033_0MergeRAWSIMoutput does not belong to the workflow pdmvserv_EGM-PhaseIFall16DR-00014_00022_v0__170324_201657_9223. @dabercro, could you please take a look?

paorozo commented 7 years ago

On the other hand, I realized when we ban a site, we usually enable the xrootd for the related tasks. Sometimes the only site by default is the one we want to ban. So, when we ban a site we need to be able of:

dabercro commented 7 years ago

Okay, I should be able to fix the task problem today. I already know what caused it. The other requests are definitely possible, but I'll have to think about how I want to do that.

dabercro commented 7 years ago

38 Should fix the extra steps.

@prozober for the other two fixes, I'm thinking that if zero sites are detected as available after the action is submitted, the affected workflows won't be submitted and the operator will be prompted to select sites or xrootd for the relevant steps. Does that sound okay for now? It'll be a little more involved to give the operator a warning before the submission, though this would be the ultimate goal.

paorozo commented 7 years ago

Yes, we need to pop-up a warning saying we do not have any site to select. For example here https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_JME-PhaseISpring17GS-00002__v1_T_170422_041914_8197 As T2_FR_CCIN2P3 is going to be banned, 5 out 6 tasks won't have sites to be assigned, where can we select an alternative site to run?

dabercro commented 7 years ago

Ah, I could have answered the question directly by saying that you can just resubmit actions the standard way, and the old parameters will be over written.

Even better though: #42 now creates a page when there are recovery tasks with no sites to run at. Here, you will have to select sites manually. I haven't thought of a more clever way to do it. After submitting sites from this page, it just returns what "/getactions?days=1" would show. From here, you can check that sites are listed.

paorozo commented 7 years ago

Today I had a problem with the "Apply to multiple" option.

Four requests are getting the same fatal exception (exit code 8026), I decided to try a plain ACDC over all of them. I went to pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00001v1_T_170716_002841_3832, then I chose ACDC (no parameters were modified), and in "Apply to multiple" I selected the other three workflows. I sent the action, but this was taken only for pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00001v1_T_170716_002841_3832, the other three requests are sitting in getAction:

{"pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00003__v1_T_170716_003045_4272": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "prozober", "Parameters": {}}, "pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00004__v1_T_170716_121423_5340": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "prozober", "Parameters": {}}, "pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00002__v1_T_170716_002939_6163": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "prozober", "Parameters": {}}}

Checking the actor logs I got:

Looking at pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00003__v1_T_170716_003045_4272 for recovery options
Going to create ACDCs for  pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00003__v1_T_170716_003045_4272
Empt action submitted for workflow pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00003__v1_T_170716_003045_4272
Moving on. Parameters is blank for pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00003__v1_T_170716_003045_4272
----------------------------------------------------------------------------------------------------
Looking at pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00004__v1_T_170716_121423_5340 for recovery options
Going to create ACDCs for  pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00004__v1_T_170716_121423_5340
Empt action submitted for workflow pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00004__v1_T_170716_121423_5340
Moving on. Parameters is blank for pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00004__v1_T_170716_121423_5340
----------------------------------------------------------------------------------------------------
Looking at pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00002__v1_T_170716_002939_6163 for recovery options
Going to create ACDCs for  pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00002__v1_T_170716_002939_6163
Empt action submitted for workflow pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00002__v1_T_170716_002939_6163
Moving on. Parameters is blank for pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00002__v1_T_170716_002939_6163
finished
Fri Jul 21 14:33:03 CEST 2017

So, "Parameters" is empty, @dabercro could you please take a look?

paorozo commented 7 years ago

I forgot to mention, I selected Auto in "Site Selection Method".

dabercro commented 7 years ago

Okay, I figured out the problem. It should be fixed with #61 and hopefully the fix doesn't cause a different problem. Now when I test the submission with acdc and all defaults, I get the following parameters:

{"pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00003__v1_T_170716_003045_4272": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "dabercro", "Parameters": {"HIN-pPb816Spring16wmLHEGS-00003_0/HIN-pPb816Spring16wmLHEGS-00003_0MergeRAWSIMoutput/HIN-pPb816Summer16DR-00144_0": {"sites": ["T2_CH_CERN", "T2_US_MIT"], "memory": ""}, "HIN-pPb816Spring16wmLHEGS-00003_0": {"sites": ["T2_CH_CERN", "T2_CH_CERN_HLT", "T2_US_MIT"], "memory": ""}, "HIN-pPb816Spring16wmLHEGS-00003_0/HIN-pPb816Spring16wmLHEGS-00003_0MergeRAWSIMoutput": {"sites": ["T2_CH_CERN"], "memory": ""}}}, "pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00004__v1_T_170716_121423_5340": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "dabercro", "Parameters": {"HIN-pPb816Spring16wmLHEGS-00004_0/HIN-pPb816Spring16wmLHEGS-00004_0MergeRAWSIMoutput/HIN-pPb816Summer16DR-00145_0": {"sites": ["T2_CH_CERN", "T2_US_MIT"], "memory": ""}}}, "pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00006__v1_T_170716_002903_994": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "dabercro", "Parameters": {"HIN-pPb816Spring16wmLHEGS-00006_0/HIN-pPb816Spring16wmLHEGS-00006_0MergeRAWSIMoutput/HIN-pPb816Summer16DR-00147_0": {"sites": ["T2_CH_CERN", "T2_US_MIT"], "memory": ""}, "HIN-pPb816Spring16wmLHEGS-00006_0": {"sites": ["T2_CH_CERN", "T2_CH_CERN_HLT", "T2_US_MIT"], "memory": ""}}}, "pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00002__v1_T_170716_002939_6163": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "dabercro", "Parameters": {"HIN-pPb816Spring16wmLHEGS-00002_0/HIN-pPb816Spring16wmLHEGS-00002_0MergeRAWSIMoutput/HIN-pPb816Summer16DR-00143_0": {"sites": ["T2_CH_CERN", "T2_US_MIT"], "memory": ""}}}, "pdmvserv_task_HIN-pPb816Spring16wmLHEGS-00005__v1_T_170716_002920_52": {"Action": "acdc", "Reasons": [], "ACDCs": [], "user": "dabercro", "Parameters": {"HIN-pPb816Spring16wmLHEGS-00005_0/HIN-pPb816Spring16wmLHEGS-00005_0MergeRAWSIMoutput/HIN-pPb816Summer16DR-00146_0": {"sites": ["T2_CH_CERN", "T2_US_MIT"], "memory": ""}}}}

dabercro commented 7 years ago

I have a meeting for the next hour, but I'll merge this onto vocms0113 after it's over.

dabercro commented 7 years ago

I should maybe also mention that I tested using my desktop as a server. I didn't submit new information to vocms0113.