Shared memory is a problem (was: Unbelievable memory readings from ML3)

NAICNO / Jobanalyzer

Easy to use resource usage report

MIT License

0 stars 1 forks source link

Shared memory is a problem (was: Unbelievable memory readings from ML3) #697

Open lars-t-hansen opened 21 hours ago

lars-t-hansen commented 21 hours ago

On 21 November, the memory readings (RAM) went very far above 100%, reaching almost 1300%. This is absurd, but it is repeated several times during the past week.

The config and node readings for ML3 agree: the machine has 128GB RAM.

Both dashboards display the problem. It could be a problem with the base data (readings) or postprocessing or naicreport, though probably not the dashboards (completely separate code).

lars-t-hansen commented 21 hours ago

Probably not naicmonitor because sonalyze jobs presents very high readings:

$ sonalyze jobs -cluster mlx -host ml3 -u adamjak -fmt job,user,res
job      user     res-avg  res-peak
513585   adamjak  791      797
2137644  adamjak  785      791
3793797  adamjak  765      765
3901112  adamjak  672      672
3947792  adamjak  697      701
260787   adamjak  672      672
402009   adamjak  694      699

lars-t-hansen commented 21 hours ago

From htop, these appear to be separate but concurrent processes that share memory. This is not something we're currently trying to deal with, and I'm not sure we can... But we may have to, b/c it could look like somebody is using fork() within Python to create some kind of threading.

May need to record private memory in sonar to work around this or at least detect it? We really want PSS, but we can't have it...

lars-t-hansen commented 21 hours ago

A couple of features would be good here:

sonalyze jobs should be able to list all the pids that go into a job
the old "breakdown" feature of jobs (which never worked) would have been handy

Can do a few things with "profile" but it's not great.

We also see a similar problem as for #675 - there's a job tree here that would be good to visualize.

lars-t-hansen commented 21 hours ago

Related is #77, of course.