flh1 super slow - Githubissues

hurleyLi commented 7 years ago

Hi @tatarsky,

For whatever reason the flh1 has very little free mem, and the head node is super slow. Not sure whether it's because the rsync process that @hirokomatsui runs or because of open/dead notebooks, or something else. Could you please help us look into it so that we can avoid eating all the memory on the head node in the future.

Thanks, Hurley

tatarsky commented 7 years ago

Please schedule a group concall to discuss the repeated problems with jupyter-notebooks and the users that do not seem to be monitoring their memory use or active nature. This is a repeated issue and needs some discussion.

I do not set the policies here. @hirokomatsui and @nariai do. This basically boils down to a head node policy being needed on long running processes. And perhaps a purchase of some additional RAM if all these notebooks I see are really needed and/or shell memory limits.

If these are NOT needed notebooks, we need to determine if there is some setting these users have that keeps leaving them around.

Here is a ps aux|grep jupy to see relative counts and users with them:

 14 djakubosky
 13 hel070
 22 joreyna
 36 matteo
 83 mdonovan
  2 nnariai
  6 paola

We do not have the resources to run this many. Particularly with some using 5-6GB memory use each.

So I'm starting with the following:

I am killing any notebook still running from 2016 now.

With the assumption they are dead/detached.

Until that concall is scheduled I will be start doing the following at noon your time today:

Any jupyter notebook over 3 days old shall be assumed to be a dead notebook and killed.

tatarsky commented 7 years ago

For example you have the following still running from last year. Can you confirm/deny this is dead/non-used? I'll leave them for now as its 2:20AM in the morning there.

Anyone can check for this sort of thing with ps. Perhaps a simple wrapper like "oldnotebooks" or something for people to type as a command with a "kill" option?

hel070     756  0.0  0.0 480360 54316 ?        Ssl   2016   0:48 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-589820ed-b5ef-40f1-aed8-4608c22d42b5.json

hel070    1565  0.0  0.0 489400 40076 ?        Ssl   2016   0:35 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-89cfd06c-c5bd-4411-9316-3ce83aa98449.json

hel070    4888  0.0  0.0 621672 158328 ?       Ssl   2016   4:59 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-76089a30-6577-4d3f-8981-d223148a4e95.json

hel070    6297  0.0  0.0 586576 94888 ?        Ssl   2016   0:12 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-aff19fbd-61bf-4e85-8ac0-48140db1a965.json

hel070   10741  0.0  0.0 471428 76696 ?        Ssl   2016   0:03 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-421df07f-ff57-4ec0-ba9a-333ba6e712de.json

hel070   23598  0.0  0.0 444296  3360 ?        Ssl   2016   0:01 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-d26379c6-079a-4ac7-8e16-e9edf9db9a0a.json

hel070   24311  0.0  0.0 596272 124056 ?       Ssl   2016   0:27 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-dc88c973-16e1-4a53-b99f-34b43e01aaab.json

hel070   13309  0.0  0.0 456656 61364 ?        Ssl   2016   0:02 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-70f145c8-2e9f-4673-8cf7-ed1413016e05.json

And then several that I would be killing under the "3 days old" policy.

hel070   15650  0.0  0.1 758104 353008 ?       Ssl  Jan09   2:26 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-c573fd66-77fc-4514-8665-ba7de1ee6ed0.json

hel070   18743  0.0  0.0 349500 31164 ?        Ssl  Jan07   0:00 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-489c5f1c-7c44-4328-a29b-c5c9f6318c15.json

hel070   20117  0.0  0.0 349500 32184 ?        Ssl  Jan09   0:00 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-896fdef7-07e8-4762-87fe-a7c8253428c2.json

hel070   26910  0.0  0.0 441012 63752 ?        Ssl  Jan13   0:08 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-0e463f53-fa0f-4e48-b4ce-1b9e6dcbe200.json

hel070   11237  0.0  0.0 420332 43124 ?        Ssl  Jan12   0:01 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-d5bac441-f7e7-45c6-ba59-f38275bedef1.json

And a old samtools process:

hel070   20809  0.0  0.0  18280   280 ?        S     2016   0:00 samtools view /frazer01/projects/CARDIPS/pipeline/Hi-C/sample/0be6b9ce-a16b-41b9-8371-f5a5d6a719ca/0be6b9ce-a16b-41b9-8371-f5a5d6a719ca.merged_nodups.filtered.intra.longRange.bam

Note clearly: I do not lightly start killing peoples processes. I assume they are monitoring their processes and have a reason for them. So I still want the call to confirm such a policy is needed because that monitoring is either not obvious to do or training/oversight/nightly email nagging about it.

tatarsky commented 7 years ago

Then in terms of raw memory impact and shell limits. We need to understand notebooks like these from matteo for example that are 5-6GB each. If we set shell memory limits this class of process would be impacted and die at whatever number we set. Aka if we say "no more than 2GB per process" these can no longer be run. (I cannot limit memory PER USER remember. Its done per process.)

matteo    1848  0.2  1.2 3626600 3331772 ?     Ssl  Jan13  13:12 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/matteo/.local/share/jupyter/runtime/kernel-be89783e-f375-471a-bcea-46519a32f8bb.json
matteo   15870  0.1  1.4 4146388 3785696 ?     Ssl  Jan16   4:17 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/matteo/.local/share/jupyter/runtime/kernel-9287228d-9ec4-42a7-8ef5-a4b900838ca6.json
matteo   11680  0.8  2.6 7332132 6941816 ?     Ssl  Jan17   7:32 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/matteo/.local/share/jupyter/runtime/kernel-0523c81b-84ef-4cb8-ae76-406e58ae792f.json

tatarsky commented 7 years ago

There is a fair number of Jupyter Notebook tickets that discuss a need for the code to clean these up themselves. Noting one or two for reference. No solution seen in them yet.

https://github.com/jupyterhub/jupyterhub/issues/680

Possibly interesting item as last comment of this but to be clear its USER driven:

https://github.com/ipython/ipython/issues/5539

tatarsky commented 7 years ago

Also as I watch memory, this appears to be something that should be running as a compute job:

hel070   20810  0.0  0.0 133772  3664 ?        S     2016   0:00 python /frazer01/projects/CARDIPS/pipeline/Hi-C/script/samToBe2d
hel070   29645 16.0  0.0 183800 32252 ?        R    02:57   0:00 /frazer01/home/hel070/anaconda2/lib64/R/bin/exec/R --slave --no-restore --file=/frazer01/projects/CARDIPS/pipeline/Hi-C/script/convert3ColToMatrix.forAssign.R --args CM_4850_A.3col perContactWithHapAndCount/CM_assignHap_matrix/CM_4850_A.mat CM

Is there a reason this is running on the head node? It keeps bubbling to the top of memory use (brief and small usage but spikes) and then spawns another R. So it feels like a job based method would be more useful.

tatarsky commented 7 years ago

Also for the discussion: if all these notebooks are needed, we currently have the RAM in this unit (and fl-hn2) half populated at 256GB using 32GB sticks. (it takes 64GB sticks but they are often quite expensive)

We could double the memory with 8 x 32GB sticks filling it up to 512GB.

That would require purchase approval. So perhaps a combination of that and some cleanup methods would reduce the swapping I see.

hurleyLi commented 7 years ago

Thanks Paul for digging into this. It looks like most of them are dead / unused notebooks. For my processes, I think I didn't kill them appropriately. I didn't even know their existence until this morning. I usually open a notebook from a "screen", when I kill the notebook, I'll attach that screen, and press twice control-c, and then kill that screen. But occasionally, I might directly kill the screen without killing the notebook first, I suspect that might be the reason why there's old notebooks hanging around? And I highly suspect that old notebooks from @mkrdonovan and @s041629 are also because they were not being killed appropriately.

I agree we should have some kind of policy to limit each process memory usage.

hurleyLi commented 7 years ago

It's weird that, even those old notebooks are hanging around, I have no way to access them, because when I kill them (either appropriately or inappropriately), they don't have any "port" to access them anymore. But somehow it seems they're still running in background.

tatarsky commented 7 years ago

Yep. If you read some of those Git issues for the software this is due to various issues/complexities in network daemon code. The main issue for me (or a script) is I don't think I can determine what "state" they are in. Let me see if I can learn something from their socket state.

billgreenwald commented 7 years ago

I'm going to comment to clarify how jupyter notebooks should be handled, since I just discussed with Hurley and we found a disconnect.

When you type "jupyter notebook" into the terminal, it launches a jupyter notebook server on your port.

This server A) lets you connect through a browser and B) manages all of your notebook kernels. You only need to have one of these running at a time. So having a single screen open that runs your server is fine. These processes can get quite old, and should be the only processes that run for a long period of time. If we set a timeout on jobs (say one month), then this job will die at the end of each month, and need to be restarted. When your notebook server dies, it kills all of your open kernels running from it.

Inside of the notebook server on the web interface, there is a "Running" tab. From here, you can click "Shutdown" on any notebook. This is the equivalent of being in the notebook and selecting "File"-->"Close and halt". This will kill the kernel for that particular notebook. The server will still run, and you can manage other notebooks.

If there are any other questions, let me know.

tatarsky commented 7 years ago

The above makes sense to me. I assumed these were perhaps notebooks spawned from that parent web interface as well.

Note there is no "job" involved here on the head node. As in a scheduler job. The processes on the head nodes are run by users directly and I have only a few things I can do to control them and they are somewhat impacting at times. I only can use cgroups or shell limits to control the head nodes. They are NOT in the scheduler (per discussion a long time ago)

I can set user cpu time limits but thats not quite the same as a wallclock time limit possibly via a scheduler and can be rather confusing. It would be nice if the software had some kind of "terminate after 24 hours of no action or something"

I think basically training exactly what you said above to all Notebook users is really the most freedom preserving method to deal with this.....

billgreenwald commented 7 years ago

The kernels each appear as their own process in top/htop/ps (*note on this at the bottom). The main thing is to kill the notebooks you arent using, and to only have a single notebook server open at a time. I think jupyter has a weird interaction of running multiple servers, each for different kernels.

*note on processes: for some reason, the jobs are listed multiple times, all with the exact same usage, but the actual usage is only for one. For example, I may see in htop:

100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py

However, there is actually only one 100% CPU process running, and I am only using 10GB of ram, not 50GB of ram. The long string of numbers is a unique hash for each notebook kernel, so by checking the numbers against your other processes, you can see which processes are unique and which are listed multiple times.

I think that you sometimes just count how many processes we have running and total their ram Paul, and this is what makes a weird disparity between your report and the actual usage on the cluster. For exmaple, in december when you said mdonovan had ~80 processes running, she really only had 7, but many of them were listed at least 10 times. I dont know how to fix this, and I dont know how to make it easier to see which kernel is taking up a lot of ram. Shutting down notebooks as necessary, or porting them out to the compute nodes, however, should help keep the head nodes free.

tatarsky commented 7 years ago

I'm looking purely at a ps command output on the single host (no cluster involved as in SGE cluster). But its very possible if this is threaded code there is some shared ram going on but they still have unique pids.

For example am looking at this:

ps aux|grep IRkernel|grep mdonovan

They all have low memory percents (fourth colum in ps). But I interpret that as twelve processes (all have unique pid) and I see he .json id number unique. Does that translate to "12 notebooks" ????

//

Can notebooks be run via Qlogin? I thought the main reason people were running on the head nodes was that they are the only exposed systems and folks didn't always understand SSH port forwarding.

billgreenwald commented 7 years ago

I do have a script for tunneling ports through sdge to head nodes, so we could run them without exposing the compute nodes in an interactive setting. However, I am unsure how this works with the jobs scheduler--my understanding is that they would die if they went above the requested RAM or CPU usage.

As far as mdonovan, I am in htop, and I currently see 4 unique notebook kernel hashes, each one displayed 3 times. So this would be 12 processes as you see from ps aux | grep, but is in actuality only 4 notebooks.

Here is a screenshot

billgreenwald commented 7 years ago

For another example, here is a single notebook listed about 10 times on flh2 (one of mine).

tatarsky commented 7 years ago

The above is probably showing threads.

tatarsky commented 7 years ago

Also, while htop is nice I believe its showing the same "mem%" concept from ps when sorted with "M"

atop in memory mode is possibly helpful to see some more details I don't know if htop delves into.

atop

Press "m" and wait a few seconds.

SWAPSZ may be the area we want to alert on for "slowness".

tatarsky commented 7 years ago

Oh and you can show threads with "y" in that same. So I'm watching that for a bit.

tatarsky commented 7 years ago

I am showing 54% of the system ram is being used by this process @hirokomatsui

What is it?

hiroko   11985 99.9 53.6 141982084 141757780 ? R    Jan18 711:12 perl /frazer01/home/hiroko/install/circos-0.69-4/bin/circos -conf circos.conf

While new, I'd like to better understand what the head nodes are being used for compared to perhaps some additional high memory nodes with perhaps exposed IP addresses. I can explain more if folks would like. (Basically some "notebook servers")

hirokomatsui commented 7 years ago

Sorry I should have ran it on compute nodes. I've killed it. It a drawing software I don't know much about it yet.

tatarsky commented 7 years ago

OK. Cool. Mostly I'm just trying to see the forest through the trees on the head nodes. And make real suggestions for dealing with them.

tatarsky commented 7 years ago

Setting policy per #173

frazer-lab / cluster

flh1 super slow #172