Closed tatarsky closed 5 years ago
I don't think it's my R jobs, as I have been running them for a few days. Maybe Paola?
No, its not you.
Looks like several ZMQbg
from djakubos having a cumulative effect.
20461 - 11 0 8K 11640K 473.1M 140K 636.4M 27244K 0K 0K 0K 0K djakubos djakubos 0% ZMQbg/1
14606 - 0 0 8K 11640K 473.1M 140K 636.4M 27228K 0K 0K 0K 0K djakubos djakubos 0% ZMQbg/1
13703 - 0 0 8K 11640K 473.1M 140K 636.4M 27212K 0K 0K 0K 0K djakubos djakubos 0% ZMQbg/1
18788 - 0 0 8K 11640K 473.1M 140K 636.4M 26912K 0K 0K 0K 348K djakubos djakubos 0% ZMQbg/1
2029 - 0 0 8K 11640K 473.1M 140K 636.4M 26884K 0K 0K 0K 0K djakubos djakubos 0% ZMQbg/1
9656 - 0 0 8K 11640K 473.1M 140K 636.4M 26280K 0K 0K 0K 708K djakubos djakubos 0% ZMQbg/1
12114 - 0 0 8K 11640K 473.8M 140K 637.1M 26092K 0K 0K 0K 2256K djakubos djakubos 0% ZMQbg/1
Perhaps @djakubosky can confirm/deny these are known
All those items atop lists as ZMQbg are really part of python notebooks running as the user. And that fact may not be known.
If you do: ps aux|grep (your username)
you can see any stray processes you have. A cleanup is helpful if they are not active.
I'm talking to everyone who doesn't watch this. paola found some rogue tasks and killed. David is busy but I'll bring it to his attention when he is free.
On Tue, Mar 12, 2019 at 2:26 PM Paul Tatarsky notifications@github.com wrote:
All those items atop lists as ZMQbg are really part of python notebooks running as the user. And that fact may not be known.
If you do: ps aux|grep (your username) you can see any stray processes you have. A cleanup is helpful if they are not active.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/276#issuecomment-472187672, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukVMXwD2FpxkkBOGGe2n6VzYtv5PJks5vWBttgaJpZM4br4zS .
Greatly appreciated. I'm just trying to avoid processes being killed when we hit the wall.
I'm happy BTW to kill processes based on age but I don't want to kill something running for a long time just because its a long running process....for all I know its doing actual work.
So I feel the best method is "check your process totals" and clean up anything you don't think you need periodically. Thanks again.
Paola finished killing processes; down to 55G. David will kill when he finishes, should drop the rest down to a normal standing level (id guess ~20G at most)
On Tue, Mar 12, 2019 at 2:30 PM Paul Tatarsky notifications@github.com wrote:
I'm happy BTW to kill processes based on age but I don't want to kill something running for a long time just because its a long running process....for all I know its doing actual work.
So I feel the best method is "check your process totals" and clean up anything you don't think you need periodically. Thanks again.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/276#issuecomment-472189241, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukRs9mDu3-_zKit2GuJAUMtWktK8eks5vWByRgaJpZM4br4zS .
Yep. Alert just cleared. So thank you and closing!
This continues to be a problem. Nearly half the ram is occupied by this process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
paola 24919 6.0 45.3 120320072 119726308 pts/17 Sl Mar25 150:14 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /run/user/1035/jupyter/kernel-5c2d27c0-8670-4631-8394-5d30d46bc298.json
If thats actively doing work fine but be aware we're swapping hard and we'll probably have some OOM issues soon.
Spot checks off and on have not shown this happening again. Its clearly visible in the Ganglia graphs but I'm unclear of the root cause. I have one suspicion that it was the drive usage graphs which I turned off due to wanting to rule it out. I will move them to another method.
Suspecting it was the drive graphs. Closing for now and looking for alternative solution.
Please all users check if you have stray jobs on fl-hn1 as its running very low on memory.
Possible items include:
Some R jobs I saw a moment ago but now do not so perhaps thats already done.