frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

fl-hn1 reporting low memory #276

Closed tatarsky closed 5 years ago

tatarsky commented 5 years ago

Please all users check if you have stray jobs on fl-hn1 as its running very low on memory.

Possible items include:

  PID   TID MINFLT  MAJFLT VSTEXT  VSLIBS  VDATA  VSTACK  VSIZE  RSIZE   PSIZE  VGROW   RGROW SWAPSZ  RUID      MEM  CMD        1/5
 7700     -      0       0     8K  232.2M   9.1G    156K   9.7G   7.0G      0K     0K      0K   372K  djakubos   3%  ZMQbg/1
16423     -   1666       0     8K  431.3M   4.1G    180K   5.1G   2.0G      0K     0K      0K 31100K  djakubos   1%  ZMQbg/1

Some R jobs I saw a moment ago but now do not so perhaps thats already done.

s041629 commented 5 years ago

I don't think it's my R jobs, as I have been running them for a few days. Maybe Paola?

tatarsky commented 5 years ago

No, its not you.

Looks like several ZMQbg from djakubos having a cumulative effect.

20461      -      11       0       8K  11640K  473.1M     140K  636.4M  27244K      0K       0K      0K      0K   djakubos  djakubos    0%  ZMQbg/1
14606      -       0       0       8K  11640K  473.1M     140K  636.4M  27228K      0K       0K      0K      0K   djakubos  djakubos    0%  ZMQbg/1
13703      -       0       0       8K  11640K  473.1M     140K  636.4M  27212K      0K       0K      0K      0K   djakubos  djakubos    0%  ZMQbg/1
18788      -       0       0       8K  11640K  473.1M     140K  636.4M  26912K      0K       0K      0K    348K   djakubos  djakubos    0%  ZMQbg/1
 2029      -       0       0       8K  11640K  473.1M     140K  636.4M  26884K      0K       0K      0K      0K   djakubos  djakubos    0%  ZMQbg/1
 9656      -       0       0       8K  11640K  473.1M     140K  636.4M  26280K      0K       0K      0K    708K   djakubos  djakubos    0%  ZMQbg/1
12114      -       0       0       8K  11640K  473.8M     140K  637.1M  26092K      0K       0K      0K   2256K   djakubos  djakubos    0%  ZMQbg/1

Perhaps @djakubosky can confirm/deny these are known

tatarsky commented 5 years ago

All those items atop lists as ZMQbg are really part of python notebooks running as the user. And that fact may not be known.

If you do: ps aux|grep (your username) you can see any stray processes you have. A cleanup is helpful if they are not active.

billgreenwald commented 5 years ago

I'm talking to everyone who doesn't watch this. paola found some rogue tasks and killed. David is busy but I'll bring it to his attention when he is free.

On Tue, Mar 12, 2019 at 2:26 PM Paul Tatarsky notifications@github.com wrote:

All those items atop lists as ZMQbg are really part of python notebooks running as the user. And that fact may not be known.

If you do: ps aux|grep (your username) you can see any stray processes you have. A cleanup is helpful if they are not active.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/276#issuecomment-472187672, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukVMXwD2FpxkkBOGGe2n6VzYtv5PJks5vWBttgaJpZM4br4zS .

tatarsky commented 5 years ago

Greatly appreciated. I'm just trying to avoid processes being killed when we hit the wall.

tatarsky commented 5 years ago

I'm happy BTW to kill processes based on age but I don't want to kill something running for a long time just because its a long running process....for all I know its doing actual work.

So I feel the best method is "check your process totals" and clean up anything you don't think you need periodically. Thanks again.

billgreenwald commented 5 years ago

Paola finished killing processes; down to 55G. David will kill when he finishes, should drop the rest down to a normal standing level (id guess ~20G at most)

On Tue, Mar 12, 2019 at 2:30 PM Paul Tatarsky notifications@github.com wrote:

I'm happy BTW to kill processes based on age but I don't want to kill something running for a long time just because its a long running process....for all I know its doing actual work.

So I feel the best method is "check your process totals" and clean up anything you don't think you need periodically. Thanks again.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/276#issuecomment-472189241, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukRs9mDu3-_zKit2GuJAUMtWktK8eks5vWByRgaJpZM4br4zS .

tatarsky commented 5 years ago

Yep. Alert just cleared. So thank you and closing!

tatarsky commented 5 years ago

This continues to be a problem. Nearly half the ram is occupied by this process:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
paola    24919  6.0 45.3 120320072 119726308 pts/17 Sl Mar25 150:14 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /run/user/1035/jupyter/kernel-5c2d27c0-8670-4631-8394-5d30d46bc298.json

If thats actively doing work fine but be aware we're swapping hard and we'll probably have some OOM issues soon.

tatarsky commented 5 years ago

Spot checks off and on have not shown this happening again. Its clearly visible in the Ganglia graphs but I'm unclear of the root cause. I have one suspicion that it was the drive usage graphs which I turned off due to wanting to rule it out. I will move them to another method.

http://flh1.ucsd.edu/ganglia/graph_all_periods.php?h=fl-hn1&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1559318174&g=mem_report&z=large&c=FrazerNodes

tatarsky commented 5 years ago

Suspecting it was the drive graphs. Closing for now and looking for alternative solution.