Open mrocklin opened 4 years ago
So I'll put my 2 cents at the moment in Dask-Jobqueue I am the main bottleneck (when @guillaumeeb was the main person involved he was in a similar position) and I am able to devote I would guesstimate 4 hours per week max. This would be definitely be great to get more people involved both on the "how do I debug this" or "how do I work in a convenient way with Dask with my particular cluster constraints"! Github may not be the best way I don't know. Who knows maybe my idea to create teams in #42 may help to structure a few Dask-on-HPC (which is wider than only Dask-Jobqueue) users and get them to talk with each other without me. I think @mrocklin you are super good at this and I'll definitely take any advice from you and follow it blindly on this topic (github or private whatever you prefer).
A few other comments:
PyTorch
stuff in long batch jobs and nobody is using Jupyter on HPC clusters).More a recent anecdote on the newfangled AI cluster in France called Jean Zay (BTW I started https://github.com/jean-zay-users/jean-zay-doc as a way to get a user-contributed doc on Jean Zay):
sshuttle
is a nice tool that lets you work-around this but most people don't know about it (I myself bumped into it completely randomly a few months ago ...). I am very hesitant to talk to Jean Zay user support about that because they may freak out and take some measures to prevent people from using it.Pangeo also had weekly meetings, which helped I think. Eventually they became a little bit too frequent I think for some people. In Dask we switched to monthly and attendance and engagement has been good.
I'm sorry to hear about the frustration around Jean Zay. My hope is that by creating some more stories of how different HPC centers solve these problems it is easier for less progressive groups to also change. It is maybe easier for Jean Zay to change if they hear that Summit or NERSC has made a very similar change.
I've been willing to build something like that for some times, but did not find enough bandwith to put enough together.
My idea was to build this through Pangeo, with means like https://discourse.pangeo.io/c/hpc/8 and https://github.com/pangeo-data/pangeo-for-hpc.
These two places would be perfect to discuss things like
arguing for more support for interactive/Jupyter workflows, dealing with NFS
or
maybe latching on to the Jupyter on HPC thing is a good idea, it seems like there definitely some overlap
or the Jean-Zay story from @lesteve. On this point, I just want to share (I'm still hoping to find some time to do it) how things are deployed at CNES (things like https://discourse.pangeo.io/t/how-about-creating-a-pangeo-hpc-github-repo/150?u=geynard), and hopefully other clusters (like NCAR).
These places would be for high level integrations/deployments documentations or discussions. I'll really like to see things like
creating more stories of how different HPC centers solve these problems it is easier for less progressive groups to also change. It is maybe easier for Jean Zay to change if they hear that Summit or NERSC has made a very similar change.
For the rest, I'll stick with dask, dask-jobqueue and dask-mpi github issue tracker. I think this is goot for more specific questions or problem.
I've no strong feeling about workshops or regular virtual meetings. I often find them useful when I can participate, but I've too many work or private constraints to be able to attend to them.
Also cc @wtbarnes who was active on Pangeo for HPC.
I'm glad to see this discussion happening. I agree having a centralized forum to organize / discuss around Dask on HPC would be useful. Since Pangeo has incubated much of this discussion already, I'm happy to suggest up the Pangeo HPC forum: https://discourse.pangeo.io/c/hpc/8
However, I also understand that it may be advantageous to separate this effort from Pangeo.
Based on the posts above, it sounds like a cultural issues at HPC are part of the challenge. The HPC community is more traditionally academic than the broader open-source community, so I think something that can be effective at pushing for Dask support is peer-reviewed publications in computing journals and presentations at conferences like Supercomputing. To that end, we did recently publish a paper on scaling of Pangeo (basically xarray + dask) in an HPC context: https://link.springer.com/chapter/10.1007/978-3-030-44728-1_12
Odaka, Tina Erica, Anderson Banihirwe, Guillaume Eynard-Bontemps, Aurelien Ponte, Guillaume Maze, Kevin Paul, Jared Baker, and Ryan Abernathey. "The Pangeo Ecosystem: Interactive Computing Tools for the Geosciences: Benchmarking on HPC." In Tools and Techniques for High Performance Computing, pp. 190-204. Springer, Cham, 2019.
What would really get the attention of the HPC folks would be a paper and press release from a national lab that "Dask on HPC solves major problem X using 100_000 cores in 1 hour". But I know this is not Dask's main aim or niche.
There's the Interactive HPC workshops at both the SC and ISC conferences that are probably good places to push things a little while also getting publications. ISC this year is going online, and the deadline for proposals for that workshop are extended to the end of this month (though there's been no communication so far about whether this workshop will go online as well).
There's also a relatively new EU project called Fenix that might be a good vehicle for getting Dask functional at more HPC sites. I'm not involved but I know that this project covers authentication/security with major European sites involved and JupyterHub is mentioned in a number of it's deliverables. Interactive Supercomputing is something of a hot topic (especially with them) so is probably a useful angle (politically).
@ocaisa I am one of the organizers of both the ISC and SC Interactive HPC workshops this year; I've worked on PyHPC at SC in the past as well. ISC announced its decision about the main conference going online about 5 days ago, but has not yet informed us what the plan is for workshops. As soon as we know we will update the interactive HPC website, but yes, the deadline is the end of this month (extended).
With respect to conferences around HPC, in the US there is also PEARC, this year in Portland (maybe). Vendors also run things like user group meetings for sites that run their systems, these are more exclusive, but if HPC centers can raise the flag for things that "make Dask work better" things can happen. I organized an interactivity BoF for the Cray User Group meeting that was to happen next month in NZ but that's been pushed off to later in the year.
The cultural aspect. This depends on a lot of factors like what size the HPC center is, what kind of institution it is embedded in, what its mission is, what its funding model is, what its funding source is, and frankly whether management not only executes but has vision. Some places have more discretion or compatible requirements than others, but they all have common core concerns around security, data integrity, access, etc. @rabernat is right that hard numbers and demonstrations help a lot; I don't think these have to come in through papers per se though.
Sorry I forgot to answer
I'm happy to suggest up the Pangeo HPC forum: discourse.pangeo.io/c/hpc/8
I am more than happy to recommend the Pangeo forum to discuss dask-jobqueue things and to get involved there. In fact I kind of forgot about this but I posted once there (and I also get the weekly summary by email from this Discourse).
Thanks for all your insights, this definitely helps to getter a better understanding of the bigger picture.
I would also be interested in following Jupyter in HPC activity from time to time, be it only to understand better what is going on and to have some arguments for convincing the Jean Zay people. @rcthomas is there some kind of central place to follow the Jupyter in HPC activity or does that happen more in multiple places: Jupyter github issues, Jupyter discourse, private emails etc ... ?
As an aside, I'm going to try to organize an "EA and HPC" birds of a feather at this year's Supercomputing, which will be in Atlanta this November. (Presuming it, too, doesn't fall victim to the pandemic.)
On Mon, Apr 13, 2020 at 10:57 AM R. C. Thomas notifications@github.com wrote:
@ocaisa https://github.com/ocaisa I am one of the organizers of both the ISC and SC Interactive HPC workshops this year; I've worked on PyHPC at SC in the past as well. ISC announced its decision about the main conference going online about 5 days ago, but has not yet informed us what the plan is for workshops. As soon as we know we will update the interactive HPC website, but yes, the deadline is the end of this month (extended).
With respect to conferences around HPC, in the US there is also PEARC https://www.pearc.org/, this year in Portland (maybe). Vendors also run things like user group meetings for sites that run their systems, these are more exclusive, but if HPC centers can raise the flag for things that "make Dask work better" things can happen. I organized an interactivity BoF for the Cray User Group meeting that was to happen next month in NZ but that's been pushed off to later in the year.
The cultural aspect. This depends on a lot of factors like what size the HPC center is, what kind of institution it is embedded in, what its mission is, what its funding model is, what its funding source is, and frankly whether management not only executes but has vision. Some places have more discretion or compatible requirements than others, but they all have common core concerns around security, data integrity, access, etc. @rabernat https://github.com/rabernat is right that hard numbers and demonstrations help a lot; I don't think these have to come in through papers per se though.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/community/issues/43#issuecomment-612934105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZHWSBI3J5FA7XBZOM765DRMMR43ANCNFSM4MFTEYVA .
-- mcoletti@gmail.com
@ocaisa The ISC HPC Workshop on Interactivity is going to be delayed to 2021. There will be an interactivity workshop at SC20 though, and the Interactive HPC website should get updated soon to include that workshop.
@lesteve the pandemic derailed something that might have been very helpful to you, which was a Jupyter community workshop on security. Eventually that workshop will happen so watch the Jupyter blog for announcements.
I'd say right now the thing that's been sustained is an informal monthly check-in of folks who run JupyterHub or administrate Jupyter at HPC centers. It's titled "batchspawner and friends" but the topics are usually anything specific to Jupyter at HPC centers: what everyone is up to, what've we learned recently, planning meetings or meet-ups, strategizing, coordinating work as best we can when needed. It's been going 6 months now, it's an experiment, but I've found it useful. Looking at dask-jobqueue after looking at batchspawner, it seemed to me there might be a lot to share about there.
I think the right place for online discussion not directly related to github issues is the Jupyter Discourse. There's an HPC topic but the above call seems to get more interaction going.
Eventually that workshop will happen so watch the Jupyter blog for announcements.
Great to hear that!
Looking at dask-jobqueue after looking at batchspawner, it seemed to me there might be a lot to share about there.
Yep there is definitely some overlap, if you think that's an option, I would be interested to attend one of these meeting to get a better feeling what this looks like in practice.
cc @jglaser (in case this is of interest 😉)
There are many HPC administrators that support Dask at their institution. They all have variants of the same problems (wrestling with job schedulers, getting fast interconnects to work, arguing for more support for interactive/Jupyter workflows, dealing with NFS). It would be useful if these folks had a place to communicate with each other comfortably, and hopefully generate content to engage other HPC centers.
Today probably the most active place is the dask-jobqueue issue tracker (thanks @lesteve and @guillaumeeb for maintaining that). Are there other community building activities that we might want to try?
A comment from @rcthomas
Are there options that we could do here that would interest people?
cc @lesteve @guillaumeeb @rcthomas @kmpaul @andersy005 @piprrr @rabernat @jhamman @zonca @stuarteberg @rsignell-usgs @pwolfram @d-v-b