Closed hurleyLi closed 7 years ago
Please schedule a group concall to discuss the repeated problems with jupyter-notebooks and the users that do not seem to be monitoring their memory use or active nature. This is a repeated issue and needs some discussion.
I do not set the policies here. @hirokomatsui and @nariai do. This basically boils down to a head node policy being needed on long running processes. And perhaps a purchase of some additional RAM if all these notebooks I see are really needed and/or shell memory limits.
If these are NOT needed notebooks, we need to determine if there is some setting these users have that keeps leaving them around.
Here is a ps aux|grep jupy
to see relative counts and users with them:
14 djakubosky
13 hel070
22 joreyna
36 matteo
83 mdonovan
2 nnariai
6 paola
We do not have the resources to run this many. Particularly with some using 5-6GB memory use each.
So I'm starting with the following:
I am killing any notebook still running from 2016 now.
With the assumption they are dead/detached.
Until that concall is scheduled I will be start doing the following at noon your time today:
Any jupyter notebook over 3 days old shall be assumed to be a dead notebook and killed.
For example you have the following still running from last year. Can you confirm/deny this is dead/non-used? I'll leave them for now as its 2:20AM in the morning there.
Anyone can check for this sort of thing with ps. Perhaps a simple wrapper like "oldnotebooks" or something for people to type as a command with a "kill" option?
hel070 756 0.0 0.0 480360 54316 ? Ssl 2016 0:48 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-589820ed-b5ef-40f1-aed8-4608c22d42b5.json
hel070 1565 0.0 0.0 489400 40076 ? Ssl 2016 0:35 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-89cfd06c-c5bd-4411-9316-3ce83aa98449.json
hel070 4888 0.0 0.0 621672 158328 ? Ssl 2016 4:59 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-76089a30-6577-4d3f-8981-d223148a4e95.json
hel070 6297 0.0 0.0 586576 94888 ? Ssl 2016 0:12 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-aff19fbd-61bf-4e85-8ac0-48140db1a965.json
hel070 10741 0.0 0.0 471428 76696 ? Ssl 2016 0:03 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-421df07f-ff57-4ec0-ba9a-333ba6e712de.json
hel070 23598 0.0 0.0 444296 3360 ? Ssl 2016 0:01 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-d26379c6-079a-4ac7-8e16-e9edf9db9a0a.json
hel070 24311 0.0 0.0 596272 124056 ? Ssl 2016 0:27 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-dc88c973-16e1-4a53-b99f-34b43e01aaab.json
hel070 13309 0.0 0.0 456656 61364 ? Ssl 2016 0:02 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-70f145c8-2e9f-4673-8cf7-ed1413016e05.json
And then several that I would be killing under the "3 days old" policy.
hel070 15650 0.0 0.1 758104 353008 ? Ssl Jan09 2:26 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-c573fd66-77fc-4514-8665-ba7de1ee6ed0.json
hel070 18743 0.0 0.0 349500 31164 ? Ssl Jan07 0:00 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-489c5f1c-7c44-4328-a29b-c5c9f6318c15.json
hel070 20117 0.0 0.0 349500 32184 ? Ssl Jan09 0:00 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-896fdef7-07e8-4762-87fe-a7c8253428c2.json
hel070 26910 0.0 0.0 441012 63752 ? Ssl Jan13 0:08 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-0e463f53-fa0f-4e48-b4ce-1b9e6dcbe200.json
hel070 11237 0.0 0.0 420332 43124 ? Ssl Jan12 0:01 /frazer01/home/hel070/R-3.3.0/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/hel070/.local/share/jupyter/runtime/kernel-d5bac441-f7e7-45c6-ba59-f38275bedef1.json
And a old samtools process:
hel070 20809 0.0 0.0 18280 280 ? S 2016 0:00 samtools view /frazer01/projects/CARDIPS/pipeline/Hi-C/sample/0be6b9ce-a16b-41b9-8371-f5a5d6a719ca/0be6b9ce-a16b-41b9-8371-f5a5d6a719ca.merged_nodups.filtered.intra.longRange.bam
Note clearly: I do not lightly start killing peoples processes. I assume they are monitoring their processes and have a reason for them. So I still want the call to confirm such a policy is needed because that monitoring is either not obvious to do or training/oversight/nightly email nagging about it.
Then in terms of raw memory impact and shell limits. We need to understand notebooks like these from matteo
for example that are 5-6GB each. If we set shell memory limits this class of process would be impacted and die at whatever number we set. Aka if we say "no more than 2GB per process" these can no longer be run. (I cannot limit memory PER USER remember. Its done per process.)
matteo 1848 0.2 1.2 3626600 3331772 ? Ssl Jan13 13:12 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/matteo/.local/share/jupyter/runtime/kernel-be89783e-f375-471a-bcea-46519a32f8bb.json
matteo 15870 0.1 1.4 4146388 3785696 ? Ssl Jan16 4:17 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/matteo/.local/share/jupyter/runtime/kernel-9287228d-9ec4-42a7-8ef5-a4b900838ca6.json
matteo 11680 0.8 2.6 7332132 6941816 ? Ssl Jan17 7:32 /home/matteo/software/R-3.2.2/lib64/R/bin/exec/R --slave -e IRkernel::main() --args /frazer01/home/matteo/.local/share/jupyter/runtime/kernel-0523c81b-84ef-4cb8-ae76-406e58ae792f.json
There is a fair number of Jupyter Notebook tickets that discuss a need for the code to clean these up themselves. Noting one or two for reference. No solution seen in them yet.
https://github.com/jupyterhub/jupyterhub/issues/680
Possibly interesting item as last comment of this but to be clear its USER driven:
Also as I watch memory, this appears to be something that should be running as a compute job:
hel070 20810 0.0 0.0 133772 3664 ? S 2016 0:00 python /frazer01/projects/CARDIPS/pipeline/Hi-C/script/samToBe2d
hel070 29645 16.0 0.0 183800 32252 ? R 02:57 0:00 /frazer01/home/hel070/anaconda2/lib64/R/bin/exec/R --slave --no-restore --file=/frazer01/projects/CARDIPS/pipeline/Hi-C/script/convert3ColToMatrix.forAssign.R --args CM_4850_A.3col perContactWithHapAndCount/CM_assignHap_matrix/CM_4850_A.mat CM
Is there a reason this is running on the head node? It keeps bubbling to the top of memory use (brief and small usage but spikes) and then spawns another R. So it feels like a job based method would be more useful.
Also for the discussion: if all these notebooks are needed, we currently have the RAM in this unit (and fl-hn2) half populated at 256GB using 32GB sticks. (it takes 64GB sticks but they are often quite expensive)
We could double the memory with 8 x 32GB sticks filling it up to 512GB.
That would require purchase approval. So perhaps a combination of that and some cleanup methods would reduce the swapping I see.
Thanks Paul for digging into this. It looks like most of them are dead / unused notebooks. For my processes, I think I didn't kill them appropriately. I didn't even know their existence until this morning. I usually open a notebook from a "screen", when I kill the notebook, I'll attach that screen, and press twice control-c, and then kill that screen. But occasionally, I might directly kill the screen without killing the notebook first, I suspect that might be the reason why there's old notebooks hanging around? And I highly suspect that old notebooks from @mkrdonovan and @s041629 are also because they were not being killed appropriately.
I agree we should have some kind of policy to limit each process memory usage.
It's weird that, even those old notebooks are hanging around, I have no way to access them, because when I kill them (either appropriately or inappropriately), they don't have any "port" to access them anymore. But somehow it seems they're still running in background.
Yep. If you read some of those Git issues for the software this is due to various issues/complexities in network daemon code. The main issue for me (or a script) is I don't think I can determine what "state" they are in. Let me see if I can learn something from their socket state.
I'm going to comment to clarify how jupyter notebooks should be handled, since I just discussed with Hurley and we found a disconnect.
When you type "jupyter notebook" into the terminal, it launches a jupyter notebook server on your port.
This server A) lets you connect through a browser and B) manages all of your notebook kernels. You only need to have one of these running at a time. So having a single screen open that runs your server is fine. These processes can get quite old, and should be the only processes that run for a long period of time. If we set a timeout on jobs (say one month), then this job will die at the end of each month, and need to be restarted. When your notebook server dies, it kills all of your open kernels running from it.
Inside of the notebook server on the web interface, there is a "Running" tab. From here, you can click "Shutdown" on any notebook. This is the equivalent of being in the notebook and selecting "File"-->"Close and halt". This will kill the kernel for that particular notebook. The server will still run, and you can manage other notebooks.
If there are any other questions, let me know.
The above makes sense to me. I assumed these were perhaps notebooks spawned from that parent web interface as well.
Note there is no "job" involved here on the head node. As in a scheduler job. The processes on the head nodes are run by users directly and I have only a few things I can do to control them and they are somewhat impacting at times. I only can use cgroups or shell limits to control the head nodes. They are NOT in the scheduler (per discussion a long time ago)
I can set user cpu time limits but thats not quite the same as a wallclock time limit possibly via a scheduler and can be rather confusing. It would be nice if the software had some kind of "terminate after 24 hours of no action or something"
I think basically training exactly what you said above to all Notebook users is really the most freedom preserving method to deal with this.....
The kernels each appear as their own process in top/htop/ps (*note on this at the bottom). The main thing is to kill the notebooks you arent using, and to only have a single notebook server open at a time. I think jupyter has a weird interaction of running multiple servers, each for different kernels.
*note on processes: for some reason, the jobs are listed multiple times, all with the exact same usage, but the actual usage is only for one. For example, I may see in htop:
100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py 100% CPU, 10GB RAM billsIpythonKernel/some-long-string-of-numbers.py
However, there is actually only one 100% CPU process running, and I am only using 10GB of ram, not 50GB of ram. The long string of numbers is a unique hash for each notebook kernel, so by checking the numbers against your other processes, you can see which processes are unique and which are listed multiple times.
I think that you sometimes just count how many processes we have running and total their ram Paul, and this is what makes a weird disparity between your report and the actual usage on the cluster. For exmaple, in december when you said mdonovan had ~80 processes running, she really only had 7, but many of them were listed at least 10 times. I dont know how to fix this, and I dont know how to make it easier to see which kernel is taking up a lot of ram. Shutting down notebooks as necessary, or porting them out to the compute nodes, however, should help keep the head nodes free.
I'm looking purely at a ps command output on the single host (no cluster involved as in SGE cluster). But its very possible if this is threaded code there is some shared ram going on but they still have unique pids.
For example am looking at this:
ps aux|grep IRkernel|grep mdonovan
They all have low memory percents (fourth colum in ps). But I interpret that as twelve processes (all have unique pid) and I see he .json id number unique. Does that translate to "12 notebooks" ????
//
Can notebooks be run via Qlogin? I thought the main reason people were running on the head nodes was that they are the only exposed systems and folks didn't always understand SSH port forwarding.
I do have a script for tunneling ports through sdge to head nodes, so we could run them without exposing the compute nodes in an interactive setting. However, I am unsure how this works with the jobs scheduler--my understanding is that they would die if they went above the requested RAM or CPU usage.
As far as mdonovan, I am in htop, and I currently see 4 unique notebook kernel hashes, each one displayed 3 times. So this would be 12 processes as you see from ps aux | grep, but is in actuality only 4 notebooks.
Here is a screenshot
For another example, here is a single notebook listed about 10 times on flh2 (one of mine).
The above is probably showing threads.
Also, while htop is nice I believe its showing the same "mem%" concept from ps when sorted with "M"
atop in memory mode is possibly helpful to see some more details I don't know if htop delves into.
atop
Press "m" and wait a few seconds.
SWAPSZ may be the area we want to alert on for "slowness".
Oh and you can show threads with "y" in that same. So I'm watching that for a bit.
I am showing 54% of the system ram is being used by this process @hirokomatsui
What is it?
hiroko 11985 99.9 53.6 141982084 141757780 ? R Jan18 711:12 perl /frazer01/home/hiroko/install/circos-0.69-4/bin/circos -conf circos.conf
While new, I'd like to better understand what the head nodes are being used for compared to perhaps some additional high memory nodes with perhaps exposed IP addresses. I can explain more if folks would like. (Basically some "notebook servers")
Sorry I should have ran it on compute nodes. I've killed it. It a drawing software I don't know much about it yet.
OK. Cool. Mostly I'm just trying to see the forest through the trees on the head nodes. And make real suggestions for dealing with them.
Setting policy per #173
Hi @tatarsky,
For whatever reason the flh1 has very little free mem, and the head node is super slow. Not sure whether it's because the rsync process that @hirokomatsui runs or because of open/dead notebooks, or something else. Could you please help us look into it so that we can avoid eating all the memory on the head node in the future.
Thanks, Hurley