CMSCompOps / WorkflowWebTools

https://workflowwebtools.readthedocs.io
1 stars 7 forks source link

Issue changing the memory value #97

Open paorozo opened 6 years ago

paorozo commented 6 years ago

We have a problem with this ACDC https://cmsweb.cern.ch/reqmgr2/fetch?rid=vlimant_ACDC0_task_HIN-HINPbPbSpring18GS-00001__v1_T_180312_140104_4651 I changed the memory using the recovery tool. When I check the request's JSON in reqmgr, this is the task configuration:

"Task2": {
    "KeepOutput": true, 
    "PrepID": "HIN-HINPbPbSpring18DR-00001", 
    "InputFromOutputModule": "RAWSIMoutput", 
    "GlobalTag": "100X_upgrade2018_realistic_v10", 
    "TimePerEvent": 16.7631, 
    "FilterEfficiency": 1, 
    "SplittingAlgo": "EventAwareLumiBased", 
    "ProcessingString": "NoPU_100X_upgrade2018_realistic_v10", 
    "ScramArch": [
      "slc6_amd64_gcc630"
    ], 
    "SizePerEvent": 16234.1829, 
    "ConfigCacheID": "76ef2a5da99194730d69d73d8b52df82", 
    "Memory": "8000", 
    "Multicore": 4, 
    "TaskName": "HIN-HINPbPbSpring18DR-00001_0", 
    "AcquisitionEra": "HINPbPbSpring18DR", 
    "PrimaryDataset": "Hydjet_Quenched_Cymbal5Ev8_PbPbMinBias_5020GeV_2018", 
    "Campaign": "HINPbPbSpring18DR", 
    "InputTask": "HIN-HINPbPbSpring18GS-00001_0", 
    "CMSSWVersion": "CMSSW_10_0_3"
  }, 

But in config: https://cmsweb.cern.ch/reqmgr2/config?name=vlimant_ACDC0_task_HIN-HINPbPbSpring18GS-00001__v1_T_180312_140104_4651

vlimant_ACDC0_task_HIN-HINPbPbSpring18GS-00001__v1_T_180312_140104_4651.tasks.HIN-HINPbPbSpring18GS-00001_0.input.splitting.performance.memoryRequirement = 4695.0

There might be something broken at the actor side, @vlimant, @areinsvo could you please help me to take a look?

areinsvo commented 6 years ago

Looking at the actor logs I don't see anything suspicious. @vlimant any idea where I should try to track down the issue? https://cms-unified.web.cern.ch/cms-unified//logs/actor/2018-03-12_14:00:48.log

vlimant commented 6 years ago

I think, and @amaltaro will confirm, that for memory to have an effect on ACDC, it has to be set at assignment time.

vlimant commented 6 years ago

and it used to "work" because the MaxRSS was updated at assignment time (using "Memory": "8000") while now it's slaved to the spec "memoryRequirement = 4695.0"

amaltaro commented 6 years ago

It works during creation as well, but mind this small detail: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/Resubmission.py#L44

which means, if you're ACDCing a TaskChain workflow, then Memory argument has to have a dictionary value.

vlimant commented 6 years ago

it's done this way https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/actor.py#L70

vlimant commented 6 years ago

wait. do you mean that the value in the nested TaskX do not matter, but the base Memory Parameter has to be a dict with Task:Memory ?

amaltaro commented 6 years ago

For Resubmission, yes, that's correct! We don't re-evaluate all the parameters and call the setters, ACDC simply truncates the original workload (so there are no attributes changed). Honestly, I think ACDCs should not support any updates during creation, only during assignment (as it already does).

vlimant commented 6 years ago

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/Resubmission.py#L44 and the following lines makes -no sense- what so ever, except that it is a copy paste from the assignment code.

because of a time de-correlation between creation time and assignement time, the change at creation should be allowed and supported without having to do unnatural conversions.

Can you please motivate why it should only be done at assignement time (if not only for practical reason of coding this in wmcore) ?

Bo, at least we know why the ACDC are failing now. we dropped the maxrss overriding by unified.

@areinsvo can you please go ahead and change actor so that it creates a dictionary TaskName:Memory and set payload['Memory'] = that_dictionnary.

amaltaro commented 6 years ago

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/Resubmission.py#L44 and the following lines makes -no sense- what so ever, except that it is a copy paste from the assignment code.

I'm pretty sure there was a reason to make it that complicated, unfortunately I don't remember and can't find what was that. I'll look at it again and see if we can remove this over-complication.

The reason it shouldn't be supported during creation time is:

I think these are pretty good reasons ;)

areinsvo commented 6 years ago

https://github.com/CMSCompOps/WmAgentScripts/commit/0becde6f43139efc8bb79e09ee894ba31af272b9#diff-699b8f6dbca6e1b3cf8365e884aaaf0e Memory is now passed as a dict for task chains. @prozober please submit a test workflow and let me know if it doesn't work.

paorozo commented 6 years ago

@areinsvo, I created this ACDC using our tool https://cmsweb.cern.ch/reqmgr2/config?name=vlimant_ACDC1_task_HIN-HINPbPbSpring18GS-00001__v1_T_180316_112116_2010

I checked it and I saw: vlimant_ACDC1_task_HIN-HINPbPbSpring18GS-00001__v1_T_180316_112116_2010.tasks.HIN-HINPbPbSpring18GS-00001_0.input.splitting.performance.memoryRequirement = '8000' The memory requirement is a string, this is not OK it should be an integer. I am not quite sure what kind of failure it will bring @vlimant

areinsvo commented 6 years ago

You're right. I was missing some int() values. Can you resubmit the action?