Open hjoliver opened 7 years ago
Although it sounds like a good idea, I have many questions.
There is a pure Python implementation of the Open Grid Forum API called SAGA, which does look like it can do most of what we want, but it looks seriously heavy weight at almost 40K lines of code (c.f. cylc which has about 55K lines of internal code). Just the PBS adaptor has >1300 lines of code, with lots of logic to deal with outputs from qsub
, qstat
, etc from different versions of PBS. It is very impressive, but it is also quite scary. (We only have 78 lines of code in cylc's PBS handler!)
The alternative is the C APIs + Python binding solution. The problem is that it looks very fragmented and a lot of dependencies. Are we guaranteed that sites using cylc will all have the DRMAA C libraries installed for their batch schedulers? (Or do they come with the batch schedulers as standards these days?)
I think the best strategy for cylc is to have a good look at the Open Grid Forum documentation and see if we can align our configuration, naming convention and terminology with theirs. If and when the Open Grid Forum technology and APIs really become out-of-the-box standards for batch schedulers, we will then have no problem migrating cylc to use those APIs.
Yeah, our current batch system support is light-weight and flexible. A downside is that we can't express batch system directives generically, which would make suite portability easier. I presume this API allows that, but if it is (I've heard rumours) "lowest common denominator" support, that might be a problem for some of our use cases.
Ideally, batch schedulers will move towards implementing their CLI options and job directives in the direction of the DRMAA API, then suite portability will be solved without us having to do anything!
Some time ago I used PBS Torque, and wrote a small Java program. It got a few more users who asked for DRMAA support. In the end I never finished the implementation, but I remember using Galaxy's API as reference.
Galaxy is a well-established bioinformatics middleware (I like it for visualisation, but you can use for workflow orchestration, data management, etc, as well).
It is written in Python, and supports DRMAA, https://docs.galaxyproject.org/en/latest/admin/cluster.html
It has been ages since I looked at Galaxy's code, but I remember I enjoyed, and found what I was looking for quickly. So perhaps this file could be useful for whoever implements it: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/drmaa.py
Cheers
Was reading my news feed when I saw this blog post by Dask: Dask on HPC.
There is one researcher in Auckland that uses Dask to parse/process large datasets from climate data in his own notebook, as Dask helps him to easily break it down into smaller parallel chunks of process.
I haven't used it, but noticed there are a few Python tools coming out with Dask support, e.g. Apache Airflow.
The blog post has a paragraph that could be interesting for this PR:
Dask interacts natively with our existing job schedulers (SLURM/SGE/LSF/PBS/…) so there is no additional system to set up and manage between users and IT. All the infrastructure that we need is already in place.
Planning to spend some training time until the end of the year to learn more about Dask, and understand how it can be used by tools and libraries.
I believe Met Office post processing people already use Dask inside Cylc suites (presumably to distribute single Python tasks across multiple cores). We should talk about this next week in Exeter.
E.g. Iris is configured to use Dask at MO. However, my believe is that Dask's primary focus is on work flow and parallelisation at the application or sub-application level, whereas Cylc's focus is on the level above.
Another recent project aiming at this space is Portable Submission Interface for Jobs (PSI/J) - Python Library.
Could we replace our custom support for each batch scheduler (and batch scheduler job submission via the shell:
qsub
etc.) with generic use of the DRMAA?From wikipedia:
Also:
Possible problems?
at
jobs (but maybe we could write an API back end for them).