dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Memory is inactive in recovery procedure #7132

Closed vlimant closed 7 years ago

vlimant commented 7 years ago

I am going to try and describe yet another issue I found with the recovery procedure. https://cmsweb.cern.ch/reqmgr/view/details/vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_121213_7651

was created with the dict

{"createRequest": {"InitialTaskPath": "/jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872/DataProcessing", "OriginalRequestName": "jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872", "CollectionName": "jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872_74d6cf4e-6f62-11e6-92d6-02163e00f196", "PrepID": "ReReco-HIRun2015-02May2016-0004", "Campaign": "HIRun2015", "Requestor": "vlimant", "RequestPriority": 900000.0, "ACDCDatabase": "acdcserver", "Memory" : 2300, "TimePerEvent": 6.0, "RequestType": "Resubmission", "ACDCServer": "https://cmsweb.cern.ch/couchdb", "SizePerEvent": 300, "Group": "DATAOPS", "IgnoredOutputModules": [], "RequestString": "recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016758p4"}, "changeSplitting": {"DataProcessing": {"SplittingAlgo": "LumiBased", "halt_job_on_file_boundaries": "True", "lumis_per_job": 1}}, "assignRequest": {"MaxRSS": 2411724, "Team": "production", "UnmergedLFNBase": "/store/unmerged", "Dashboard": "reprocessing", "MaxVSize": 20411724, "SiteWhitelist": ["T2_US_Vanderbilt"], "MergedLFNBase": "/store/hidata", "AcquisitionEra": "HIRun2015", "ProcessingString": "02May2016", "ProcessingVersion": 2}}

vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_121213_7651.request.schema.Memory = 2300

but

vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_121213_7651.tasks.DataProcessing.input.splitting.performance.memoryRequirement = 9000.0

is picked up from "somewhere"

Creating https://cmsweb.cern.ch/reqmgr/view/details/vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_123305_8955 with (removing Memory)

{"createRequest": {"InitialTaskPath": "/jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872/DataProcessing", "OriginalRequestName": "jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872", "CollectionName": "jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872_74d6cf4e-6f62-11e6-92d6-02163e00f196", "PrepID": "ReReco-HIRun2015-02May2016-0004", "Campaign": "HIRun2015", "Requestor": "vlimant", "RequestPriority": 900000.0, "ACDCDatabase": "acdcserver", "TimePerEvent": 6.0, "RequestType": "Resubmission", "ACDCServer": "https://cmsweb.cern.ch/couchdb", "SizePerEvent": 300, "Group": "DATAOPS", "IgnoredOutputModules": [], "RequestString": "recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016758p4"}, "changeSplitting": {"DataProcessing": {"SplittingAlgo": "LumiBased", "halt_job_on_file_boundaries": "True", "lumis_per_job": 1}}, "assignRequest": {"MaxRSS": 2411724, "Team": "production", "UnmergedLFNBase": "/store/unmerged", "Dashboard": "reprocessing", "MaxVSize": 20411724, "SiteWhitelist": ["T2_US_Vanderbilt"], "MergedLFNBase": "/store/hidata", "AcquisitionEra": "HIRun2015", "ProcessingString": "02May2016", "ProcessingVersion": 2}}

vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_123305_8955.request.schema.Memory = 9000

vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_123305_8955.tasks.DataProcessing.input.splitting.performance.memoryRequirement = 9000.0

Now creating https://cmsweb.cern.ch/reqmgr/view/details/vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4__160831_124752_3839 with (Memory = 12G) {"createRequest": {"InitialTaskPath": "/jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872/DataProcessing", "OriginalRequestName": "jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872", "CollectionName": "jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4_160816_142741_8872_74d6cf4e-6f62-11e6-92d6-02163e00f196", "PrepID": "ReReco-HIRun2015-02May2016-0004", "Campaign": "HIRun2015", "Requestor": "vlimant", "RequestPriority": 900000.0, "ACDCDatabase": "acdcserver", "Memory" : 12000, "TimePerEvent": 6.0, "RequestType": "Resubmission", "ACDCServer": "https://cmsweb.cern.ch/couchdb", "SizePerEvent": 300, "Group": "DATAOPS", "IgnoredOutputModules": [], "RequestString": "recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016758p4"}, "changeSplitting": {"DataProcessing": {"SplittingAlgo": "LumiBased", "halt_job_on_file_boundaries": "True", "lumis_per_job": 1}}, "assignRequest": {"MaxRSS": 2411724, "Team": "production", "UnmergedLFNBase": "/store/unmerged", "Dashboard": "reprocessing", "MaxVSize": 20411724, "SiteWhitelist": ["T2_US_Vanderbilt"], "MergedLFNBase": "/store/hidata", "AcquisitionEra": "HIRun2015", "ProcessingString": "02May2016", "ProcessingVersion": 2}}

vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4160831_124752_3839.request.schema.Memory = 12000 vlimant_recovery-0-jen_a_HIRun2015-HIHardProbesPeripheral-02May2016_758p4160831_124752_3839.tasks.DataProcessing.input.splitting.performance.memoryRequirement = 9000.0

which means that we cannot properly set the memory requirement in this recovery procedure. MaxRSS is one handle, but for job matching, it's Memory that matters.

I want to double check what is the behavior of a regular ACDC with adjusted memory parameter, but I think it will be the same

amaltaro commented 7 years ago

How did you assign it? Would you have the dictionary used for assignment?

vlimant commented 7 years ago

There are not assigned at all. as far as I know the assignRequest is not used at all in that case (either assign on web, or script)

jenimal commented 7 years ago

right now, we assign ReReco via the web interface so we can adjust the memory. this needs to be done for Multicore workflows so it is a "knob" we need in both the scripts and the web interface if we want to remain flexible.

jenimal commented 7 years ago

When you make ACDCs the memory isn't copied from the parent workflow either, you always need to adjust that parameter manually when assigning.

vlimant commented 7 years ago

@jenimal Memory is different than MaxRSS. and we have to do this right all the way.

vlimant commented 7 years ago

Any lead on solving this ?

amaltaro commented 7 years ago

Testbed deadline is this evening. But I will try to get it done tomorrow/Wednesday, just need to finish other unfinished bug fixes first

amaltaro commented 7 years ago

Are there any other creation parameters that you'd need to override?

vlimant commented 7 years ago

TimePerEvent maybe, so that we get a better RequestTime classad, I cannot think of anything else right now. Maybe you have suggestions

amaltaro commented 7 years ago

Giving it a second thought, since it's a resubmission, it makes sense to make it a clone of the original request (or closest to it).

Hence, my suggestion on this case would be to override Memory during assignment. We had a strong push in the past to make it available during assignment, so can you try it please?