Open paorozo opened 6 years ago
Looking at the actor logs I don't see anything suspicious. @vlimant any idea where I should try to track down the issue? https://cms-unified.web.cern.ch/cms-unified//logs/actor/2018-03-12_14:00:48.log
I think, and @amaltaro will confirm, that for memory to have an effect on ACDC, it has to be set at assignment time.
and it used to "work" because the MaxRSS was updated at assignment time (using "Memory": "8000") while now it's slaved to the spec "memoryRequirement = 4695.0"
It works during creation as well, but mind this small detail: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/Resubmission.py#L44
which means, if you're ACDCing a TaskChain workflow, then Memory argument has to have a dictionary value.
wait. do you mean that the value in the nested TaskX do not matter, but the base Memory Parameter has to be a dict with Task:Memory ?
For Resubmission, yes, that's correct! We don't re-evaluate all the parameters and call the setters, ACDC simply truncates the original workload (so there are no attributes changed). Honestly, I think ACDCs should not support any updates during creation, only during assignment (as it already does).
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/Resubmission.py#L44 and the following lines makes -no sense- what so ever, except that it is a copy paste from the assignment code.
because of a time de-correlation between creation time and assignement time, the change at creation should be allowed and supported without having to do unnatural conversions.
Can you please motivate why it should only be done at assignement time (if not only for practical reason of coding this in wmcore) ?
Bo, at least we know why the ACDC are failing now. we dropped the maxrss overriding by unified.
@areinsvo can you please go ahead and change actor so that it creates a dictionary TaskName:Memory and set payload['Memory'] = that_dictionnary.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/Resubmission.py#L44 and the following lines makes -no sense- what so ever, except that it is a copy paste from the assignment code.
I'm pretty sure there was a reason to make it that complicated, unfortunately I don't remember and can't find what was that. I'll look at it again and see if we can remove this over-complication.
The reason it shouldn't be supported during creation time is:
I think these are pretty good reasons ;)
https://github.com/CMSCompOps/WmAgentScripts/commit/0becde6f43139efc8bb79e09ee894ba31af272b9#diff-699b8f6dbca6e1b3cf8365e884aaaf0e Memory is now passed as a dict for task chains. @prozober please submit a test workflow and let me know if it doesn't work.
@areinsvo, I created this ACDC using our tool https://cmsweb.cern.ch/reqmgr2/config?name=vlimant_ACDC1_task_HIN-HINPbPbSpring18GS-00001__v1_T_180316_112116_2010
I checked it and I saw: vlimant_ACDC1_task_HIN-HINPbPbSpring18GS-00001__v1_T_180316_112116_2010.tasks.HIN-HINPbPbSpring18GS-00001_0.input.splitting.performance.memoryRequirement = '8000' The memory requirement is a string, this is not OK it should be an integer. I am not quite sure what kind of failure it will bring @vlimant
You're right. I was missing some int() values. Can you resubmit the action?
We have a problem with this ACDC https://cmsweb.cern.ch/reqmgr2/fetch?rid=vlimant_ACDC0_task_HIN-HINPbPbSpring18GS-00001__v1_T_180312_140104_4651 I changed the memory using the recovery tool. When I check the request's JSON in reqmgr, this is the task configuration:
But in config: https://cmsweb.cern.ch/reqmgr2/config?name=vlimant_ACDC0_task_HIN-HINPbPbSpring18GS-00001__v1_T_180312_140104_4651
vlimant_ACDC0_task_HIN-HINPbPbSpring18GS-00001__v1_T_180312_140104_4651.tasks.HIN-HINPbPbSpring18GS-00001_0.input.splitting.performance.memoryRequirement = 4695.0
There might be something broken at the actor side, @vlimant, @areinsvo could you please help me to take a look?