dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Agent Job Re-submission Project #11881

Open LinaresToine opened 10 months ago

LinaresToine commented 10 months ago

Impact of the new feature Impact on the WMAgent

Is your feature request related to a problem? Please describe. There are exit codes for which the jobs are simple retried without really modifying anything, when in reality something should/could be modified before resubmission.

Describe the solution you'd like For example, exit code 50660 for when a job requires higher memory, the resubmission should modify the pkl file and the sandbox with a higher memory before resubmitting the failed job. Other exit codes of interest should have a similar specific procedure of resubmission.

Describe alternatives you've considered For now we have only given thought to the retry process of high memory jobs. Although, more additions should come with this project since the main motivation is to make the retry process more automatic for exit codes that allow it. For this, a set of functions that modify the job parameters and that are used by the retry manager when dealing with a specific error code.

For the retry of high memory jobs, changing the maxPSS parameter requires a modification of the job sandbox as well as the job.pkl file. A function that takes care of such modifications should take a job id as parameter. Such function shall also define a new maxPSS, or receive it too as parameter.

Additional context

LinaresToine commented 10 months ago

To modify the job.pkl file, the first thing is to get the path of such file. I see that this line of code captures all information of the job from the database and stores it in the variable loadAction: https://github.com/dmwm/WMCore/blob/9c6e83d1d23983c0296eee318c9e6255ff80d01b/src/python/WMComponent/RetryManager/RetryManagerPoller.py#L219

Then, in https://github.com/dmwm/WMCore/blob/9c6e83d1d23983c0296eee318c9e6255ff80d01b/src/python/WMComponent/RetryManager/RetryManagerPoller.py#L226, a new variable 'result' is created. This variable is the output of the 'execute' function in: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Jobs/LoadFromID.py#L52, which is a dictionary.

The cache dir is the information of interest and I am not 100% sure if it will simply be a key of such dictionary, since there is some formatting going on

LinaresToine commented 9 months ago

An update on my previous comment:

The 'execute' function returns a list in which each index contains a dictionary with the result of the sql query for a particular job id in the input list: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Jobs/LoadFromID.py#L52

germanfgv commented 9 months ago

@amaltaro @todor-ivanov could you take a look at the proposed solution here: https://github.com/LinaresToine/WMCore/pull/3

In summary:

amaltaro commented 8 months ago

@germanfgv @LinaresToine apologies for the delay on getting back to this.

The idea looks good in general, but I do have a few concerns and further comments to be considered: a) workload.pkl is shared among all the jobs, from the WMSandbox area. Which means, if one job changes it, those changes will be visible to any other job. So this is something that needs to be further investigated. b) changing the job.pkl file means that files need to be changed in the filesystem. Which initially does not look like a great idea (compared to in memory or database changes), but given that only jobs in a given error code would go through this, I think we should proceed with this. c) monitoring!!! At the moment, the only way I see to know whether a job was customized or not, would be through the agent logs (ComponentLog of the component). If everyone agrees, we can probably move forward with this, but that means we cannot commit to debug such cases.

LinaresToine commented 8 months ago

Thank you very much @amaltaro for your comments. We shall take care of the sandbox change so that it only happens when a job's new memory is greater than the one in the sandbox. @germanfgv, any ideas on this?

A PR to the WMCore master branch was created for adequate tracking of the progress. https://github.com/dmwm/WMCore/pull/11928

LinaresToine commented 7 months ago

Hello @amaltaro. The PR was updated so that jobs get modified by task rather than sandbox. On another issue, the tests we have performed so far have the JobCreator unconfortable about the pkl files being truncated. Would you have an idea on how to work around this?

LinaresToine commented 6 months ago

For clarity, the error I have stumbled upon is:

Failed to execute JobCreator. Error: pickle data was truncated Traceback (most recent call last): File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py", line 376, in algorithm self.pollSubscriptions() File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py", line 440, in pollSubscriptions wmWorkload = retrieveWMSpec(workflow=workflow) File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py", line 47, in retrieveWMSpec wmWorkload.load(wmWorkloadURL) File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMCore/WMSpec/Persistency.py", line 65, in load self.data = pickle.load(handle) _pickle.UnpicklingError: pickle data was truncated 2024-04-29 17:59:20,954:140150441658112:ERROR:BaseWorkerThread:Error in worker algorithm (1): Backtrace: <WMComponent.JobCreator.JobCreatorPoller.JobCreatorPoller object at 0x7f775f72dfa0> <@========== WMException Start ==========@> Exception Class: JobCreatorException Message: Failed to execute JobCreator. Error: pickle data was truncated ClassName : None ModuleName : WMComponent.JobCreator.JobCreatorPoller MethodName : algorithm ClassInstance : None FileName : /data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py LineNumber : 397 ErrorNr : 0

What I observe after several tests is that this error happens rather randomly, since it is not always observed. I believe it comes from how pickle loads and manages the data of the pickle file, and for some reason it does not always tolerate changes in the pickle data. This would be a problem that gets in the way of this automatic retries of high memory jobs because modifying the memory requires modifying the job pickle file.

I would appreciate any guidance or ideas on how to tackle this problem.

LinaresToine commented 4 months ago

Hello @amaltaro ,

I believe the patch is ready for review. The only addition that remains to be tested is the changes in the ErrorHandlerPoller.py, the one described in https://github.com/LinaresToine/WMCore/blob/76c1019c364fa4b94e4d191c329666b3c5e2d73c/src/python/WMComponent/ErrorHandler/ErrorHandlerPoller.py#L11

The replays I ran I took advantage of the PauseAlgo parameter that allows you to retry jobs an arbitrary amount of time according to their job type and exit code. Specifically:

config.RetryManager.PauseAlgo.section_('Processing')
config.RetryManager.PauseAlgo.Processing.retryErrorCodes = { 70: 0, 50660: 0, 50661: 1, 50664: 0, 71304: 1 }

Since Central Production does not use PauseAlgo, I thought adding the changes in ErrorHanler was the easiest way. Please let me know what you think.

Also, the maxPSS parameter of a sandbox is not easily accessible, so I figured to keep track of that value in a dictionary called dataDict, which is kept on record in a json file in the RetryManager component directory: https://github.com/LinaresToine/WMCore/blob/76c1019c364fa4b94e4d191c329666b3c5e2d73c/src/python/WMComponent/RetryManager/Modifier/BaseModifier.py#L34

Finally, after several replays, the jobs are being resumed automatically in a successful manner, with no mismatch between job[estimatedMemoryUsage] and maxPSS. Also, to minimize number of jobs affected by sandbox modification, the maxPSS is changed per task rther than entire sandbox.

All the changes proposed can be seen in https://github.com/LinaresToine/WMCore/pull/3

Thanks again for your time and attention.

LinaresToine commented 3 months ago

Hello.

A quick update of what is going on with this issue. The patch was in https://github.com/dmwm/WMCore/pull/11928 was tested in a T0 agent and gets all jobs modified successfully. Additional changes were required for a central production agent given that they do not use the PauseAlgo, which allows for multiple retries of a failed job with a given exit code. Such modification is in the ErrorHandler. I talked with @hassan11196 to get the patch tested in a central production agent.

I would also like to note that the patch currently keeps data of the retried jobs and new memories used in a json file in the RetryManager component log. I believe that the more elegant way to do it is by adapting the oracle database to allow for this data. Something like having maxPSS data available in a WMBS table, as well as the job estimatedMemoryUsage somewhere in there for better bookkeeping. @amaltaro what do you think?