DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Open a ticket with glideinwms developers re. high-memory jobs causing glideinwms to think there are free cores when there aren't. #160

Open StevenCTimm opened 2 months ago

StevenCTimm commented 2 months ago

In dress rehearsal run we were never able to get above 2400 jobs running because glideinwms didn't correctly handle the situation where all of our glideins ran out of memory before we ran out of cores. This is a long-standing bug in glideinwms and needs to be handled from DUNE's highest levels. (Ken)

StevenCTimm commented 2 months ago

Ticket has been opened. We can work around it by (1) increasing the number of idle vms setting in the frontend and (2) patching one line of code to not count glideins with memory free less than X (defaults to 2500 but we can pick what we want).

StevenCTimm commented 1 month ago

We've done the above fixes, it helped by about 25%. I believe there's a second tweak that can be done, have followed up with the glideinWMS developers.