Closed pibion closed 4 years ago
I also think JupyterLab system monitor would be great for users to check how much memory are using.
@zonca a system monitor would be extremely helpful. In general we'd love to have more monitoring both for users and the cluster so we know what to request on allocations.
Server-side we can install Prometheus and Grafana: https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html
@zonca Prometheus and Grafana look like they'd be perfect. The integration with Slack is especially nice.
@zonca would it be possible to integrate JupyerHub on Jetstream with XSEDE batch resources?
Right now one of our analyzers is going back and forth between interactive analysis and submitting jobs that use on order 600 CPU hours. I think this would be a simple allocation to get but making it easy to submit jobs from the Jupyter environment would be awesome and significantly increase the likelihood that people would use those resources.
which supercomputers are they using? how much data (order of magnitude) are the input and output that need to move between interactive and not?
@ziqinghong could you comment?
The supercomputers currently used for this are the SLAC cluster. I think the input data is order 100 GB and the output data is an order of magnitude less.
A month worth of small detector data is 5-10 TB. (Amy, these are continuous non-triggered data, thus they're bigger than the usual numbers we quote for Snolab.) A simple model is that each batch job processes one dataset (an hour of data). The input is ~30 G. They get turned into <1 G output.
thanks @ziqinghong, is processing using MPI? how many nodes and how long does a typical job take? is the software multithreaded?
@zonca I don't believe the software is multithreaded or using MPI.
I'll let @ziqinghong comment on how many nodes and how long a typical job takes.
We don't usually use MPI. Our jobs are parallelized by splitting up datasets and running idential processing on each of them.
How long does a typical job takes depends on how many nodes/jobs we spread the task. It takes O(500) CPU-hours to process ~10 TB of data. If we spread them among 200 cores (which is consistent with our typical usage), it'll get done in 3 hours.
do you run single or multi-threaded code? is it Python?
@ziqinghong my understanding is that the code is Python and that it is single-threaded.
so 200 cores single-threaded, you mean 200 nodes, right?
On Tue, May 19, 2020 at 2:10 PM pibion notifications@github.com wrote:
@ziqinghong https://github.com/ziqinghong my understanding is that the code is Python and that it is single-threaded.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/13#issuecomment-631084186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4XQRK7ZKCED3PJYZCTRSLYT3ANCNFSM4LZZUJGQ .
Single-threaded. Not sure if I know the difference between 200 nodes and 200 cores... 200 x single thread.
thanks, for a workload that doesn't use MPI and is not too large as this one, I think we could execute that directly on Jetstream, inside Kubernetes, with dask
.
can you prepare a dataset + the code to do this data processing stage, with some documentation on how to execute, better if inside a dedicated github repository (the code, with pointers to the data). Then I can check if we can execute it with dask
.
I copied a little bit of raw data over, and a bunch of scripts. I could start an interactive data processing run in the jupyter terminal in the browser, by doing source /cvmfs/cdms.opensciencegrid.org/setup_cdms.sh V03-01 cd /cvmfs/data/tf/AnimalData/processing/ python AnimalDataAnalysiscont_blocks.py 20190718102534 20190718102534
The code is nasty, as it has a bunch of locations hard wired... It also needs an existing directory of /home/jovyan/work/blocks... If the work directory gets reset, it'll error out.
@zonca If you could give it a try, let me know if you encounter more errors. I'm still running it... Seems like it'll take O(10 minutes).
@ziqinghong if you point me to the data you're using, we can make sure we get that into the data catalog. Once we've got the code updated that is :)
thanks @ziqinghong yes, it runs fine, I'll try if using dask
I can run multiple instances of that in parallel.
actually best is if you give me a set of 10 (or better 100) different inputs, so I can try run them in parallel
AWESOME!!!!
there are too many different things in this issue, @pibion if you are still interested in any of the above, please open a dedicated issue.
This is in no way a priority, but I wanted to record the idea somewhere before I forget it:
It might be nice to have the Theia IDE available in the JupyterHub environment. It looks like there is some support for this: https://jupyter-server-proxy.readthedocs.io/en/latest/convenience/packages/theia.html.