Closed dabercro closed 7 years ago
The new ACDCs for the workflow pdmvserv_task_EXO-RunIISummer15GS-03460 have been created and hopefully assigned. @prozober, please let me know if something isn't right: Assigned workflow: areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170307_151102_1973 to site: [u'T2_DE_RWTH'] and team production Assigned workflow: areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170307_151112_2569 to site: [u'T0_CH_CERN', u'T2_US_Florida'] and team production
We have a couple of problems:
If a site is in draining, it shouldn't be "enabled" in the list we are using during the assignment. It will prevent us to have ACDCs stuck in acquired. What do you think @dabercro?
I found and fixed the problem with the task names. I will abort the two ACDCs mentioned above.
I think the normal recoveror checks which sites are ready and doesn't assign to sites in drain. I will add this to my code. However, if the operator only says to go to sites that are in drain, what should the script do? Probably refuse to create ACDCs and quit with an error message?
I cannot find a case where we need to assign the workflow to a site into drain. I think disabling the draining sites from our assignment interface would be enough.
Okay, that shouldn't take long. I'm thinking the sites in drain should be marked, so you know which they are (I'd make them red or something, it's more informative than them not being there at all) with a warning if no enabled sites are selected. How does that sound?
Doing it from the interface sounds good. As long as the site doesn't go into drain in the time between submitting the action and then running the script to actual create the ACDCs, that should work.
Ah, that's a good point. Doing it from the interface should cut down on the occurrence, but I agree that you should probably handle it in your script too.
Make the sites red in the interface is a good idea. How can we handle the script "exceptions" for the assignment? I mean, how can the script give some feedback to the interface?
@prozober as the operator, what would you like the script to do in that case? Assign it anyway, and you can catch it in the Unified critical page and deal with it, or fail with an error message, or option 3 I can't think of?
For the sites in drain, the acdc might require it the hard way (the data is only at that site) and therefore the site should be in the whitelist, and used in assignment if the action is set. So assignment should go on (the use case would be that the operator knows that the site is going to come back soon and it's ok to assign it already like this leaving the agent to start submitting when the site comes back online) http://dabercro.web.cern.ch/dabercro/unified/showlog/?search=critical&module=GQ&limit=100
is picking up the ACDC that cannot run in these situations, BTW. So let's just not build more complication into the this.
leaving the operator/AI to judge whether or not to act this way
@prozober, let me know when you've resubmitted the action the way it should be, and we can test the script again. I've changed it back so it doesn't worry about sites in drain and trusts the operator to handle it, per Jean-Roch's suggestion.
Allie, I've already sent the action. Please, go ahead.
Third time's the charm:
areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170308_163052_6405
areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170308_163108_3300
For task EXO-RunIISummer15GS-03460_0/EXO-RunIISummer15GS-03460_0MergeRAWSIMoutput/EXO-RunIISummer16DR80Premix-07371_0, I enabled the xrootd option, so, everything is OK. For EXO-RunIISummer15GS-03460_0 I didn't set any value (we need to take the value from its originalReques), but as you can see here: https://cmsweb.cern.ch/reqmgr2/fetch?rid=areinsvo_ACDC0_task_EXO-RunIISummer15GS-03460__v1_T_170308_163108_3300 we have "TrustSitelists": true
The ACDCs are running, we better let them finish.
@dabercro, is there a way we can "uncheck" the radio button for xrootd, secondary and splitting options?
@prozober https://github.com/dabercro/WorkflowWebTools/commit/35d40c9a484674ede8afcb8840326899e2ab4728 allows you to double click a button for xrootd, etc to make it false again. Since it's not a backend change, I was able to push it to the server already without a restart.
What does the json look like if it is false? Right now it is 'xrootd': 'enabled' when set to true. Safe to assume it is set to 'xrootd':'disabled' when false?
Yeah, that's exactly what it'll be. Keep in mind that it might also not be set. This thread is getting long, but I put a comment a couple days ago of how I would read the dictionary. Here it is again:
use_xrootd = response[prepID]['Parameters'].get('xrootd')
if use_xrootd is None:
# If not set, get the default value of the xrootd
use_xrootd = some_fuction_call(parameters to get recovery docs?)
# More pythonic would be:
# use_xrootd = response[prepID]['Parameters'].get('xrootd', some_fuction_call(params))
if use_xrootd == 'enabled':
# Using enabled option
elif use_xrootd == 'disabled':
# Using disabled option
else:
# Error handling for weird value
Oops, sorry about that. I did see it (and use it), but I clearly didn't remember all of the details. Thanks Dan!
There are 61 wfs in manual-assistance https://vocms049.cern.ch/unified/assistance.html#assistance-manual but https://vocms049.cern.ch/unified/all_errors.json is empty. @vlimant, @areinsvo, could you please take a look?
yes, I disabled it by mistake. It should come back in next cycles. in-fine we need to find a way to decouple to unified for building it's content. The list of workflows should be enough. maybe we have to plan how to do this
I already have something that gets errors in the same format from /wmstatsserver/data/jobdetail/ using the workflow name alone. I'll make a test branch that does this with the all_errors.json keys and compare the results with using the full file.
Taking a quick look, almost all the workflows in assistance have reading issues, I think this is the moment to test the decision making using the clustering algorithm. Once the global error is populated, I will send a couple of ACDCs. @areinsvo I will let you know when I send the actions.
@prozober To manually force the global errors to update, navigate to https://vocms0113.cern.ch:80/resetcache This is linked on the welcome page. I should make it accessible from the global errors page too.
@dabercro we might consider moving these to vocms049 and integrate in Unified, so that it has direct access to the db. The separation "setting action" and "enacting" should stay separated anyways IMO, so getting into 049 will not be an issue
That would probably be a good idea. When working on vocms0113 though, I had trouble getting mod_wsgi compiled for Python 2.7. Would you want to set that up on vocms049, or should we just use the built in Cherrypy server and only open the used port to CERN addresses?
@areinsvo I've submitted an action for the workflow pdmvserv_task_HIG-PhaseIFall16wmLHEGS-00056__v1_T_170316_161618_7407, could you please take a look?
{u'HIG-PhaseIFall16wmLHEGS-00056_0/HIG-PhaseIFall16wmLHEGS-00056_0MergeLHEoutput': {'xrootd': u'enabled', 'sites': [u'T2_UK_London_Brunel', u'T2_UK_London_IC', u'T2_UK_SGrid_Bristol', u'T2_UK_SGrid_RALPP'], 'memory': u''}, u'HIG-PhaseIFall16wmLHEGS-00056_0/HIG-PhaseIFall16wmLHEGS-00056_0MergeRAWSIMoutput': {'xrootd': u'enabled', 'sites': [u'T2_UK_London_Brunel', u'T2_UK_London_IC', u'T2_UK_SGrid_Bristol', u'T2_UK_SGrid_RALPP'], 'memory': u''}, u'HIG-PhaseIFall16wmLHEGS-00056_0/HIG-PhaseIFall16wmLHEGS-00056_0MergeRAWSIMoutput/HIG-PhaseIFall16DR-00109_0': {'sites': u'T1_US_FNAL', 'memory': u''}, 'AllSteps': {'memory': u''}}
ACDCs created for three tasks: areinsvo_ACDC0_task_HIG-PhaseIFall16wmLHEGS-00056v1_T_170328_164213_3110 areinsvo_ACDC0_task_HIG-PhaseIFall16wmLHEGS-00056__v1_T_170328_164220_3110 areinsvo_ACDC0_task_HIG-PhaseIFall16wmLHEGS-00056v1_T_170328_164228_1417
@prozober Let me know if anything needs to be changed.
@areinsvo, I created by mistake the ACDCs for this workflow today, I am sorry. I aborted your ACDCs. Could you please take a look at the action I just sent for pdmvserv_task_TRK-PhaseIFall16GS-00017__v1_T_170310_150656_7717. Thanks!
BTW, Ali's ACDCs were correctly created and assigned. Let's see how task_TRK-PhaseIFall16GS-00017 runs.
I tried to run the script on the new action that you created, but it fails my check against creating partial ACDCs. According to the ACDC documents, there should be 9 tasks to recover (see below), but the action that was submitted only includes 6 tasks (numbers 3 - 8). Am I supposed to be ignoring tasks that include "CleanupUnmerged"?
yes, cleanup should be ignored indeed. isn't recoveror doing this by default ?
Not explicitly, although I'm using WMErr rather than getSummary to get the list of tasks. Maybe cleanup jobs were already excluded from getSummary so it didn't matter before. It's an easy thing to add to my script, however.
ACDCs created for pdmvserv_task_TRK-PhaseIFall16GS-00017v1_T_170310_150656_7717: areinsvo_ACDC0_task_TRK-PhaseIFall16GS-00017__v1_T_170328_174440_2926 areinsvo_ACDC0_task_TRK-PhaseIFall16GS-00017v1_T_170328_174448_4410 areinsvo_ACDC0_task_TRK-PhaseIFall16GS-00017v1_T_170328_174457_1952 areinsvo_ACDC0_task_TRK-PhaseIFall16GS-00017__v1_T_170328_174504_9699 areinsvo_ACDC0_task_TRK-PhaseIFall16GS-00017v1_T_170328_174513_6263 areinsvo_ACDC0_task_TRK-PhaseIFall16GS-00017__v1_T_170328_174521_1144
@dabercro, we have a problem with the sites checked by default for the assignment. e.g. https://vocms0113.cern.ch:80/seeworkflow/?workflow=pdmvserv_EGM-PhaseIFall16DR-00014_00022_v0__170324_201657_9223
Could you please take a look?
The reason for that is not all the sites show up in the recovery docs. For example, if I compare with: https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_EGM-PhaseIFall16DR-00014_00022_v0__170324_201657_9223%22&include_docs=true&reduce=false I see no T1_DE_KIT, which is listed as a site that needs recovering in the table, but not checked by default. The same thing for T1_UK_RAL (an enabled site)...
What is the preferred behavior? I thought we wanted to automate using the recovery docs. Maybe my recovery doc query is wrong?
Please, forget my comment, I got completely confused. The sites by default are OK, this is the behavior we want. Sorry!
If you do not mind, could you please delete the rows of exit codes with zero occurrences? Thanks!
Okay, that should be easy.
Just a heads up, I'm working on the Auto/Manual/Ban site selection today. The backend is a little tricky, but I think I almost have it. I hope to update the server tomorrow evening.
Thanks Dan and Ali.
task_TRK-PhaseIFall16GS-00017 is running fine, just a couple of failures but they are not related to the ACDC creation and assignment.
I would like to do another test, in this case, we will change the splitting parameter.
Workflow:pdmvserv_task_EXO-PhaseIFall16GS-00011__v1_T_170309_130412_5503
Action:
u'EXO-PhaseIFall16GS-00011_0/EXO-PhaseIFall16GS-00011_0MergeRAWSIMoutput/EXO-PhaseIFall16DR-00037_0/EXO-PhaseIFall16DR-00037_0MergeRAWSIMoutput/EXO-PhaseIFall16DR-00037_1': {'sites': u'T1_ES_PIC', 'memory': u''}, u'EXO-PhaseIFall16GS-00011_0/EXO-PhaseIFall16GS-00011_0MergeRAWSIMoutput/EXO-PhaseIFall16DR-00037_0': {'memory': u'', 'sites': u'T1_ES_PIC', 'splitting': u'2x'}, u'EXO-PhaseIFall16GS-00011_0/EXO-PhaseIFall16GS-00011_0MergeRAWSIMoutput/EXO-PhaseIFall16DR-00037_0/EXO-PhaseIFall16DR-00037_0MergeRAWSIMoutput/EXO-PhaseIFall16DR-00037_1/EXO-PhaseIFall16DR-00037_1MergeAODSIMoutput/EXO-PhaseIFall16MiniAOD-00036_0/EXO-PhaseIFall16MiniAOD-00036_0MergeMINIAODSIMoutput': {'sites': u'T1_ES_PIC', 'memory': u''}, 'AllSteps': {'memory': u''}}
@areinsvo , do we know how to modify the splitting to be 2x, 3x and max?
@prozober The script can handle 2x and 3x splitting, but can you clarify what is meant by max splitting?
I tried to run the test on the workflow you suggested, but it failed with the output "I should not be doing splitting for this type of request" because the RequestType is TaskChain and 'InputDataset' is not found in Task1. This bit of code was copied over from the recoveror.py Unified module. Is this check no longer appropriate? Or should it not have failed in this case?
Only one ACDC was created (out of the 3 tasks): areinsvo_ACDC0_task_EXO-PhaseIFall16GS-00011__v1_T_170329_210649_7578 I assume this needs to be aborted until we get the script working to produce all three ACDCs at once?
The lone ACDC was aborted. areinsvo_ACDC0_task_EXO-PhaseIFall16GS-00011__v1_T_170329_210649_7578
Well, bad news @areinsvo, the ACDCs for pdmvserv_task_TRK-PhaseIFall16GS-00017__v1_T_170310_150656_7717 are not okay, and I just realized it. The tasks you mentioned here https://github.com/CMSCompOps/WorkflowWebTools/issues/6#issuecomment-289809010, were the ones we needed to ACDC. But, then I went through the six ACDCs we created, and the are related to 3 tasks. So, every task was ACDC twice.
These are the tasks related to each ACDC:
As you can see, the following tasks are missing:
I need to invalidate the duplicated files. Then I need to create the missing ACDCs through the scripts. It's better that I work on this at my late afternoon, so we can synchronize our actions.
Yes, I see the issue in my code. The problem came up when I tried to go from the task name provided in the action json to the full task name needed by req mgr. I will work on fixing this today. Early afternoon tomorrow, @prozober, if you want to resubmit the action for that workflow, I can run the script and we can work together to make sure it was fixed properly.
I will add the option for 'max' splitting to the script.
@prozober, regarding the issues with splitting, you are right that only EXO-PhaseIFall16GS-00011_0/EXO-PhaseIFall16GS-00011_0MergeRAWSIMoutput/EXO-PhaseIFall16DR-00037_0 needed the splitting changed, and the ACDC for /**/EXO-PhaseIFall16MiniAOD-00036_0MergeMINIAODSIMoutput shouldn't have any problem. The script would have done that, but as soon as one of the ACDCs has problems, it quits and doesn't try to create the rest of the ACDCs.
@vlimant Any comment on whether splitting should be allowed in the case RequestType is TaskChain and 'InputDataset' is not found in Task1? This check was copied over from recoveror.py, but I'm not sure it is valid here.
Max splitting was added and the task names are now treated correctly. Ready for another test when you are @prozober
We ran out of small and low priority workflows to test. I think we need to submit a couple of backfills workflows, @vlimant, in your opinion, what would be good candidates?
Hi @areinsvo, we have three low priority workflows to test. I just sent the action for all of them.
{"pdmvserv_task_SMP-RunIISummer16DR80Premix-00203__v1_T_170412_214135_9645": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"SMP-RunIISummer16DR80Premix-00203_0/SMP-RunIISummer16DR80Premix-00203_1/SMP-RunIISummer16DR80Premix-00203_1MergeAODSIMoutput": {"sites": ["T2_UK_London_Brunel"], "memory": ""}, "SMP-RunIISummer16DR80Premix-00203_0/SMP-RunIISummer16DR80Premix-00203_1/SMP-RunIISummer16DR80Premix-00203_1MergeAODSIMoutput/SMP-RunIISummer16MiniAODv2-00205_0/SMP-RunIISummer16MiniAODv2-00205_0MergeMINIAODSIMoutput": {"sites": ["T2_UK_London_Brunel"], "memory": ""}}}, "pdmvserv_task_SMP-RunIISummer15wmLHEGS-00115__v1_T_170407_163921_622": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"SMP-RunIISummer15wmLHEGS-00115_0/SMP-RunIISummer15wmLHEGS-00115_0MergeRAWSIMoutput/SMP-RunIISummer16DR80Premix-00201_0": {"sites": ["T1_US_FNAL", "T2_US_UCSD"], "memory": ""}, "SMP-RunIISummer15wmLHEGS-00115_0/SMP-RunIISummer15wmLHEGS-00115_0MergeRAWSIMoutput/SMP-RunIISummer16DR80Premix-00201_0/SMP-RunIISummer16DR80Premix-00201_1": {"sites": "T1_US_FNAL", "memory": ""}, "SMP-RunIISummer15wmLHEGS-00115_0/SMP-RunIISummer15wmLHEGS-00115_0MergeRAWSIMoutput/SMP-RunIISummer16DR80Premix-00201_0/SMP-RunIISummer16DR80Premix-00201_1/SMP-RunIISummer16DR80Premix-00201_1MergeAODSIMoutput/SMP-RunIISummer16MiniAODv2-00203_0": {"sites": ["T1_UK_RAL", "T1_US_FNAL"], "memory": ""}}}, "pdmvserv_task_EXO-RunIISummer15GS-09915__v1_T_170410_123400_282": {"Action": "recover", "Reasons": ["Just a test to see what it looks like."], "user": "prozober", "Parameters": {"EXO-RunIISummer15GS-09915_0/EXO-RunIISummer15GS-09915_0MergeRAWSIMoutput/EXO-RunIISummer16DR80Premix-08938_0/EXO-RunIISummer16DR80Premix-08938_1/EXO-RunIISummer16DR80Premix-08938_1MergeAODSIMoutput/EXO-RunIISummer16MiniAODv2-08873_0": {"xrootd": "enabled", "sites": ["T2_UK_London_Brunel", "T2_UK_London_IC", "T2_UK_SGrid_Bristol"], "memory": ""}}}}
Hi @prozober ,
I'm confused. For the last two workflows, everything looks fine, but for [1], there are no errors listed in the WMErr document I use to make sure we aren't doing partial ACDCs. The script fails because the number of tasks with errors in WMErr doesn't match the number of tasks in the action json. Any idea why that might be happening? [1] pdmvserv_task_SMP-RunIISummer16DR80Premix-00203__v1_T_170412_214135_9645
The ACDCs for pdmvserv_task_EXO-RunIISummer15GS-09915v1_T_170410_123400_282 and pdmvserv_task_SMP-RunIISummer15wmLHEGS-00115v1_T_170407_163921_622 have been submitted.
Maybe because the two involved tasks have unreported errors?
https://vocms049.cern.ch/unified/report/pdmvserv_task_SMP-RunIISummer16DR80Premix-00203__v1_T_170412_214135_9645 https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_task_SMP-RunIISummer16DR80Premix-00203__v1_T_170412_214135_9645%22&include_docs=true&reduce=false
I checked the two workflows left, and the ACDCs look nice. I will keep an eye on them.
@vlimant @prozober @areinsvo @mcremone
Since I don't really know everything the workflow team or Unified needs, feel free to make any comments or pull requests. We can also track the testing progress here.
From Jean-Roch:
although we have a unified wired to testbed
https://cmst2.web.cern.ch/cmst2/unified-testbed/
it might be simpler to have it wired to production and run this in "commissioning mode".
I have the feeling, looking at the example actions, that there is too many parameters passed down. Many of the them should not be needed for recovery & clone (proc version, sites, lfn, ...) since all these can be taken from the original workflow and such. let's tune this to what is actually needed.
We should modify the recoveror module to read the action json, and be able to operate it by hand. The way I see it for a fast integration is:
Viewing Actions
In order to view these actions from inside the CERN network, one can look at https://vocms0113.cern.ch:80/getaction. This shows actions submitted today. You can also pass a parameter "days" to look farther back. For example looking at https://vocms0113.cern.ch:80/getaction?days=20 will show some old testing actions.
Changing Parameters
To make it easier for everyone to track and comment on parameters, they are generated with these variables here: https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/static/js/addreason.js#L145-L184 The variable params results in a "decrease" "same" "increase" table, texts and bools are just text and "true/false" fields, and the opts variable results in more general radio buttons.
The site list is generated here: https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/templates/workflowtables.html#L19-L27 the form field is then made here https://github.com/CMSCompOps/WorkflowWebTools/blob/d167a94ff822d7a80d3350eeadc4efe014621f75/runserver/static/js/addreason.js#L231-L236