cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Run3 ReReco Processing Jobs hit memory limit and get killed at KIT-HoreKa Subsite (and possibly FNAL) #45028

Open ArturAkh opened 3 months ago

ArturAkh commented 3 months ago

As can be seen in the discussion started via https://mattermost.web.cern.ch/cms-o-and-c/pl/bynj9zdcc7raxdrufm7fjjquqr

we have agreed, that I'll provide here information on jobs failing because of hitting the memory limit, such that all this info is collected here for better offline availability.

In the past few days, we at KIT could observe, that there were a lot of Run 3 ReReco Processing jobs scheduled to our subsite KIT-HoreKa, that were failing because running into an out-of-memory issue (error code 50660), and were killed by the CMS HTCondor system/Job Wrapper because of that.

We have concluded, that this is not a site issue in the first place, since a large fraction of these jobs running at FNAL were also failing. For this, we have had a look at the following opensearch visualization:

https://monit-opensearch.cern.ch/dashboards/goto/3287b96f35862f3567a05e3d5ce47efd

To check, which workflows were affected, we adapted the visualization to get a grouping by request name:

https://monit-opensearch.cern.ch/dashboards/goto/dadc482043513e0c57c3f943b0fd366a

Another view on that can be seen here in grafana:

https://monit-grafana.cern.ch/goto/Jp57gLEIg?orgId=11

A stricking feature of these jobs, which we were seeing when comparing with equivalent jobs at Tier 1 KIT resources, that the there is a shift in the memory usage distribution between Tier 1 KIT and KIT-HoreKa (see attached picture).

image

While jobs at Tier 1 KIT were on average using around 7k MB memory for 4 cores, equivalent jobs at HoreKa were hitting the 8k MB threshold set for 4 cores. The main difference between the machines is that at Tier 1, they are running with el7, while at KIT-HoreKa machines, rhel8 is the host OS.

A similarly larger amount of memory usage is also visible for equivalent FNAL jobs, where the host OS is el9 in most cases.

To study this further, we have tried to have a look at the job logs. Unfortunately, these weren't uploaded to EOS since the jobs were killed, so a work-around had to be done. Based on the following opensearch query (please extend down to May 22, 2024):

https://monit-opensearch.cern.ch/dashboards/goto/7414e6b0643a334bad779141933fe246

We have picked a few of the failed jobs at KIT-HoreKa were no data.EOSLogURL was provided. Example:

https://monit-opensearch.cern.ch/dashboards/app/discover/#/doc/60770470-8326-11ea-88fc-cfaa9841e350/monit_prod_wmarchive_raw_metric-2024-05-23?id=46d64111f2f64477a48d69d71ef9f931

From the entry data.LFNArray, the path to the log file can be extracted, and then downloaded from KIT storage root://cmsdcache-kit-disk.gridka.de:

root://cmsdcache-kit-disk.gridka.de//store/unmerged/logs/prod/2024/5/23/pdmvserv_Run2024C_JetMET1_ECALRATIO_240521_151559_1672/DataProcessing/0003/1/d4985bf1-e3c4-4eb9-86ba-32470fb02eed-229-1-logArchive.tar.gz

In addition data.steps entry from the example opensearch record provides an error message:

 "details": "Error in CMSSW step cmsRun1\nNumber of Cores: 4\nJob has exceeded maxPSS: 8000 MB\nJob has PSS: 8085 MB\n",

For further studies, we have downloaded some log files from FNAL and KIT from the following folders to examine this further:

/store/unmerged/logs/prod/2024/5/{21,22}/pdmvserv_Run2024C_{EGamma,JetMET}*

(also other days can be included, if necessary)

So far, we at KIT observe, that it seems like there is a difference in memory usage when running a el8 container on an el7 host (lower usage) compared to an el8/el9 host (higher usage). However, we don't have a concrete explanation for this observation so far. Investigation at KIT is still ongoing.

For the time-being, Processing jobs are not accepted any more at KIT-HoreKa to prevent a high failure rate and waste of CPU until this problem is better understood.

Furthermore, these workflows might profit from an increase in requested memory when (re-)submitted to prevent such problems.

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @ArturAkh.

@smuzaffar, @antoniovilela, @sextonkennedy, @rappoccio, @makortel, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 months ago

assign core,pdmv

cmsbuild commented 3 months ago

New categories assigned: core,pdmv

@Dr15Jones,@AdrianoDee,@sunilUIET,@miquork,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 3 months ago

So far, we at KIT observe, that it seems like there is a difference in memory usage when running a el8 container on an el7 host (lower usage) compared to an el8/el9 host (higher usage). However, we don't have a concrete explanation for this observation so far. Investigation at KIT is still ongoing.

We have another report too where the same job on EL9 host uses significantly more (order of 20 %) more RSS than EL8 or EL7 host https://github.com/cms-sw/cmssw/issues/42929

makortel commented 3 months ago
root://cmsdcache-kit-disk.gridka.de//store/unmerged/logs/prod/2024/5/23/pdmvserv_Run2024C_JetMET1_ECALRATIO_240521_151559_1672/DataProcessing/0003/1/d4985bf1-e3c4-4eb9-86ba-32470fb02eed-229-1-logArchive.tar.gz

Looking from the log, the job was ran on CMSSW_14_0_7. The first MemoryCheck printout was

%MSG-w MemoryCheck:  source 23-May-2024 15:45:59 CEST PreSource
MemoryCheck: module source:source VSIZE 9637.79 0 RSS 6785.51 3.91797
%MSG
Begin processing the 5th record. Run 380005, Event 903899790, LumiSection 535 on stream 1 at 23-May-2024 15:45:59.648 CEST

6.6 GB sounds a lot for a 4-thread job, compared to the expectation of 2 GB/thread. From the printouts I see DQM modules being included in the job.

ArturAkh commented 3 months ago

Looking from the log, the job was ran on CMSSW_14_0_7. The first MemoryCheck printout was

%MSG-w MemoryCheck:  source 23-May-2024 15:45:59 CEST PreSource
MemoryCheck: module source:source VSIZE 9637.79 0 RSS 6785.51 3.91797
%MSG
Begin processing the 5th record. Run 380005, Event 903899790, LumiSection 535 on stream 1 at 23-May-2024 15:45:59.648 CEST

6.6 GB sounds a lot for a 4-thread job, compared to the expectation of 2 GB/thread. From the printouts I see DQM modules being included in the job.

Yes, indeed. At the end, the memory report states that the peak RSS size is well above 8GB:

MemoryReport> Peak rss size 8317.52 Mbytes
 Key events increasing rss:
[900] run: 380005 lumi: 535 event: 902919827  vsize = 14651.1 deltaVsize = 0 rss = 8294.34 delta = 100.824
[904] run: 380005 lumi: 535 event: 903727220  vsize = 14638.9 deltaVsize = -12.3555 rss = 8316.61 delta = -0.90625
[903] run: 380005 lumi: 535 event: 903241610  vsize = 14651.1 deltaVsize = -0.105469 rss = 8317.52 delta = 23.1836
[9] run: 380005 lumi: 535 event: 904518024  vsize = 11789.4 deltaVsize = 674.512 rss = 8001.11 delta = 119.016
[12] run: 380005 lumi: 535 event: 903713728  vsize = 11795.4 deltaVsize = 1.89453 rss = 8171.79 delta = 160.375
[5] run: 380005 lumi: 535 event: 903383179  vsize = 11046.9 deltaVsize = 1402.75 rss = 7808.14 delta = 1024.64
MessageLogger: dropped waiting message count 408
makortel commented 3 months ago

If I dug around correctly, the cmsDriver is here https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/476e400f0676a4c134fdd8b6b87845df/configFile It contains --eventcontent ALCARECO,AOD,MINIAOD,DQM --step RAW2DIGI,L1Reco,RECO,PAT,,SKIM:@JetMET1,DQM:@rerecoCommon+@hcal2 --procModifiers gpuValidationEcal

So DQM is clearly involved. I'm wondering the purpose of --procModifiers gpuValidationEcal because otherwise there doesn't seem to be anything GPU-related in the job.

The job contains total of about 2500 modules, of which 11 are OutputModules. OutputModules need memory for buffering, so each OutputModule increases memory need.

makortel commented 3 months ago

I ran some events of the example through IgProf memory profiler (on EL7) on one thread/stream. Here is the heap state after 5 events https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue45028/test_07.5_live

Some observations

makortel commented 3 months ago

One conclusion is that this kind of job would have higher chance to fit the memory limit if run with 8 threads.

ArturAkh commented 3 months ago

After extracting info from logs:

https://cernbox.cern.ch/s/p9awVmSt7nySHPV

The following plot for Peak RSS could be created:

rss_histogram

Thereby a grouping by Host OS is done:

RHofsaess commented 3 months ago

The corresponding logs are archived here: xrdfs root://cmsdcache-kit-disk.gridka.de/ ls /store/user/rhofsaess/LOGS/KIT_FNAL_Run2024C_logs.tar.gz This includes all logs, failed and successful, for the campaigns: pdmvserv_Run2024C_EGamma0..., pdmvserv_Run2024C_JetMET0..., and pdmvserv_Run2024C_JetMET1... from KIT (T1 and KIT-HoreKa subsite) and FNAL for 21.+22.05.24.

Note: Those are many files and untar'ing may take a while :smile:

$ xrdfs root://cmsdcache-kit-disk.gridka.de/ stat /store/user/rhofsaess/LOGS/KIT_FNAL_Run2024C_logs.tar.gz

Path:   /store/user/rhofsaess/LOGS/KIT_FNAL_Run2024C_logs.tar.gz
Id:     0
Size:   50219471028
MTime:  2024-05-24 14:02:26
Flags:  48 (IsReadable|IsWritable)
makortel commented 3 months ago

I ran the job with 80 first events on 4 threads on lxplus[89], directly and via el[98] containers, and with jemalloc, TCMalloc, and GlibC malloc. I extracted the peak RSS and VSIZE. The numbers are from 1 execution, so any potential statistical fluctuation is ignored.

RSS

el8 host el9 host
Jemalloc, el8 binary 6973.75 6664.13
Jemalloc, el9 binary 6783.95 7070.38
TCMalloc, el8 binary 7280.86 7260.79
TCMalloc, el9 binary 7198.72 7366
GlibC, el8 binary 7585.84 7654.41
GlibC, el9 binary 7604.34 7765.85

VSIZE

el8 host el9 host
Jemalloc, el8 binary 12790.8 12504
Jemalloc, el9 binary 12719.1 12475.7
TCMalloc, el8 binary 10084.9 9898.71
TCMalloc, el9 binary 9846.76 9991.8
GlibC, el8 binary 10745.3 10695.6
GlibC, el9 binary 10735.1 10806.2

This test did not show sizable difference between el8 and el9 hosts. Maybe the mechanism that leads to higher memory usage on el9 hosts has more dynamic nature than what my simple test exercised?

ArturAkh commented 3 months ago

Hi @makortel,

I think it would be worth testing the full processing, not only the first few events. It might be, that some particular events are causing this rise in RSS peak.

Furthermore, as far as I understood, you have tested an el8/el9 container on el8/el9 hosts, right?

At least in the context of the issue discussed here, the problem is rather that having the el7 host + el8/el9 container combination is consuming less memory, than what you have tested. So I'd say your results are as expected.

Cheers,

Artur

makortel commented 2 months ago

I think it would be worth testing the full processing, not only the first few events. It might be, that some particular events are causing this rise in RSS peak.

Perhaps, but that would take days to test. I had time only for something relatively quick, and wanted to see if there was a reproducible systematic difference between el8 and el9.

Furthermore, as far as I understood, you have tested an el8/el9 container on el8/el9 hosts, right?

Correct.

At least in the context of the issue discussed here, the problem is rather that having the el7 host + el8/el9 container combination is consuming less memory, than what you have tested.

Ah, I guess I missed this point that the difference was el7 vs el8/el9 host (possibly because in the earlier case https://github.com/cms-sw/cmssw/issues/42929 it was el9 specifically that made the difference).