det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

Useful plugins for JupyterHub / JupyterLab #13

Closed pibion closed 3 years ago

pibion commented 4 years ago

This is in no way a priority, but I wanted to record the idea somewhere before I forget it:

It might be nice to have the Theia IDE available in the JupyterHub environment. It looks like there is some support for this: https://jupyter-server-proxy.readthedocs.io/en/latest/convenience/packages/theia.html.

zonca commented 4 years ago

I also think JupyterLab system monitor would be great for users to check how much memory are using.

https://github.com/jtpio/jupyterlab-system-monitor

screencast

pibion commented 4 years ago

@zonca a system monitor would be extremely helpful. In general we'd love to have more monitoring both for users and the cluster so we know what to request on allocations.

zonca commented 4 years ago

Server-side we can install Prometheus and Grafana: https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html

pibion commented 4 years ago

@zonca Prometheus and Grafana look like they'd be perfect. The integration with Slack is especially nice.

pibion commented 4 years ago

@zonca would it be possible to integrate JupyerHub on Jetstream with XSEDE batch resources?

Right now one of our analyzers is going back and forth between interactive analysis and submitting jobs that use on order 600 CPU hours. I think this would be a simple allocation to get but making it easy to submit jobs from the Jupyter environment would be awesome and significantly increase the likelihood that people would use those resources.

zonca commented 4 years ago

which supercomputers are they using? how much data (order of magnitude) are the input and output that need to move between interactive and not?

pibion commented 4 years ago

@ziqinghong could you comment?

The supercomputers currently used for this are the SLAC cluster. I think the input data is order 100 GB and the output data is an order of magnitude less.

ziqinghong commented 4 years ago

A month worth of small detector data is 5-10 TB. (Amy, these are continuous non-triggered data, thus they're bigger than the usual numbers we quote for Snolab.) A simple model is that each batch job processes one dataset (an hour of data). The input is ~30 G. They get turned into <1 G output.

zonca commented 4 years ago

thanks @ziqinghong, is processing using MPI? how many nodes and how long does a typical job take? is the software multithreaded?

pibion commented 4 years ago

@zonca I don't believe the software is multithreaded or using MPI.

I'll let @ziqinghong comment on how many nodes and how long a typical job takes.

ziqinghong commented 4 years ago

We don't usually use MPI. Our jobs are parallelized by splitting up datasets and running idential processing on each of them.

How long does a typical job takes depends on how many nodes/jobs we spread the task. It takes O(500) CPU-hours to process ~10 TB of data. If we spread them among 200 cores (which is consistent with our typical usage), it'll get done in 3 hours.

zonca commented 4 years ago

do you run single or multi-threaded code? is it Python?

pibion commented 4 years ago

@ziqinghong my understanding is that the code is Python and that it is single-threaded.

zonca commented 4 years ago

so 200 cores single-threaded, you mean 200 nodes, right?

On Tue, May 19, 2020 at 2:10 PM pibion notifications@github.com wrote:

@ziqinghong https://github.com/ziqinghong my understanding is that the code is Python and that it is single-threaded.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/13#issuecomment-631084186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4XQRK7ZKCED3PJYZCTRSLYT3ANCNFSM4LZZUJGQ .

ziqinghong commented 4 years ago

Single-threaded. Not sure if I know the difference between 200 nodes and 200 cores... 200 x single thread.

zonca commented 4 years ago

thanks, for a workload that doesn't use MPI and is not too large as this one, I think we could execute that directly on Jetstream, inside Kubernetes, with dask.

can you prepare a dataset + the code to do this data processing stage, with some documentation on how to execute, better if inside a dedicated github repository (the code, with pointers to the data). Then I can check if we can execute it with dask.

ziqinghong commented 4 years ago

I copied a little bit of raw data over, and a bunch of scripts. I could start an interactive data processing run in the jupyter terminal in the browser, by doing source /cvmfs/cdms.opensciencegrid.org/setup_cdms.sh V03-01 cd /cvmfs/data/tf/AnimalData/processing/ python AnimalDataAnalysiscont_blocks.py 20190718102534 20190718102534

The code is nasty, as it has a bunch of locations hard wired... It also needs an existing directory of /home/jovyan/work/blocks... If the work directory gets reset, it'll error out.

@zonca If you could give it a try, let me know if you encounter more errors. I'm still running it... Seems like it'll take O(10 minutes).

pibion commented 4 years ago

@ziqinghong if you point me to the data you're using, we can make sure we get that into the data catalog. Once we've got the code updated that is :)

zonca commented 4 years ago

thanks @ziqinghong yes, it runs fine, I'll try if using dask I can run multiple instances of that in parallel.

zonca commented 4 years ago

actually best is if you give me a set of 10 (or better 100) different inputs, so I can try run them in parallel

ziqinghong commented 4 years ago

AWESOME!!!!

zonca commented 3 years ago

there are too many different things in this issue, @pibion if you are still interested in any of the above, please open a dedicated issue.