cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Sherpa related workflows get stuck due to a problem with opening an openmpi session #45165

Open ArturAkh opened 2 months ago

ArturAkh commented 2 months ago

Dear all,

At KIT, we were seeing some problems with Sherpa related workflows at our opportunistic resources (KIT-HOREKA), e.g.

data.RequestName = cmsunified_task_SMP-RunIISummer20UL18GEN-00048__v1_T_240312_112234_8747

The jobs seem to hang with a CPU usage at 0%, leading to very low efficiency (below 20%) for HoreKa resources:

https://grafana-sdm.scc.kit.edu/d/qn-VJhR4k/lrms-monitoring?orgId=1&refresh=15m&var-pool=GridKa+Opportunistic&var-schedd=total&var-location=horeka&viewPanel=98&from=1717406527904&to=1717579327904

After some investigation of the situation, we have figured out the following:

A call to mkdir was unable to create the desired directory:

  Directory: /tmp/openmpi-sessions-12009@bms1_0/52106
  Error:     No space left on device

So the entire process is unable to open an openmpi session. Even more problematic is, that the job does not fail properly but is hanging (i.e. running further with 0% efficiency). We see often this message in the logs when running locally:

Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

According to our local physics group which had some experience with running Sherpa, this is a known problem.

Resetting the $TMPDIR variable to a different location was allowing us to make the process work properly if running it manually. We are not sure though, whether this is a correct action to be taken on an entire (sub)site for all worker nodes...

We would like to know, how to resolve this issue, and whether something needs to be done in terms of openmpi libraries in the CMSSW software stack for that.

Best regards,

Artur Gottmann

cmsbuild commented 2 months ago

cms-bot internal usage

cmsbuild commented 2 months ago

A new Issue was created by @ArturAkh.

@Dr15Jones, @antoniovilela, @makortel, @sextonkennedy, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 2 months ago

assign generator

Dr15Jones commented 2 months ago

assign generators

cmsbuild commented 2 months ago

New categories assigned: generators

@alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 month ago

Ping @cms-sw/generators-l2