Open lars-t-hansen opened 21 hours ago
Probably not naicmonitor because sonalyze jobs presents very high readings:
$ sonalyze jobs -cluster mlx -host ml3 -u adamjak -fmt job,user,res
job user res-avg res-peak
513585 adamjak 791 797
2137644 adamjak 785 791
3793797 adamjak 765 765
3901112 adamjak 672 672
3947792 adamjak 697 701
260787 adamjak 672 672
402009 adamjak 694 699
From htop, these appear to be separate but concurrent processes that share memory. This is not something we're currently trying to deal with, and I'm not sure we can... But we may have to, b/c it could look like somebody is using fork() within Python to create some kind of threading.
May need to record private memory in sonar to work around this or at least detect it? We really want PSS, but we can't have it...
A couple of features would be good here:
sonalyze jobs
should be able to list all the pids that go into a jobCan do a few things with "profile" but it's not great.
We also see a similar problem as for #675 - there's a job tree here that would be good to visualize.
Related is #77, of course.
On 21 November, the memory readings (RAM) went very far above 100%, reaching almost 1300%. This is absurd, but it is repeated several times during the past week.
The config and node readings for ML3 agree: the machine has 128GB RAM.
Both dashboards display the problem. It could be a problem with the base data (readings) or postprocessing or naicreport, though probably not the dashboards (completely separate code).