cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

DRMAA? #2205

Open hjoliver opened 7 years ago

hjoliver commented 7 years ago

Could we replace our custom support for each batch scheduler (and batch scheduler job submission via the shell: qsub etc.) with generic use of the DRMAA?

From wikipedia:

DRMAA or Distributed Resource Management Application API is a high-level Open Grid Forum API specification for the submission and control of jobs to a Distributed Resource Management (DRM) system, such as a Cluster or Grid computing infrastructure. The scope of the API covers all the high level functionality required for applications to submit, control, and monitor jobs on execution resources in the DRM system.

Also:

Possible problems?

matthewrmshin commented 7 years ago

Although it sounds like a good idea, I have many questions.

There is a pure Python implementation of the Open Grid Forum API called SAGA, which does look like it can do most of what we want, but it looks seriously heavy weight at almost 40K lines of code (c.f. cylc which has about 55K lines of internal code). Just the PBS adaptor has >1300 lines of code, with lots of logic to deal with outputs from qsub, qstat, etc from different versions of PBS. It is very impressive, but it is also quite scary. (We only have 78 lines of code in cylc's PBS handler!)

The alternative is the C APIs + Python binding solution. The problem is that it looks very fragmented and a lot of dependencies. Are we guaranteed that sites using cylc will all have the DRMAA C libraries installed for their batch schedulers? (Or do they come with the batch schedulers as standards these days?)

I think the best strategy for cylc is to have a good look at the Open Grid Forum documentation and see if we can align our configuration, naming convention and terminology with theirs. If and when the Open Grid Forum technology and APIs really become out-of-the-box standards for batch schedulers, we will then have no problem migrating cylc to use those APIs.

hjoliver commented 7 years ago

Yeah, our current batch system support is light-weight and flexible. A downside is that we can't express batch system directives generically, which would make suite portability easier. I presume this API allows that, but if it is (I've heard rumours) "lowest common denominator" support, that might be a problem for some of our use cases.

matthewrmshin commented 7 years ago

Ideally, batch schedulers will move towards implementing their CLI options and job directives in the direction of the DRMAA API, then suite portability will be solved without us having to do anything!

kinow commented 6 years ago

Some time ago I used PBS Torque, and wrote a small Java program. It got a few more users who asked for DRMAA support. In the end I never finished the implementation, but I remember using Galaxy's API as reference.

Galaxy is a well-established bioinformatics middleware (I like it for visualisation, but you can use for workflow orchestration, data management, etc, as well).

It is written in Python, and supports DRMAA, https://docs.galaxyproject.org/en/latest/admin/cluster.html

It has been ages since I looked at Galaxy's code, but I remember I enjoyed, and found what I was looking for quickly. So perhaps this file could be useful for whoever implements it: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/drmaa.py

Cheers

kinow commented 5 years ago

Was reading my news feed when I saw this blog post by Dask: Dask on HPC.

There is one researcher in Auckland that uses Dask to parse/process large datasets from climate data in his own notebook, as Dask helps him to easily break it down into smaller parallel chunks of process.

I haven't used it, but noticed there are a few Python tools coming out with Dask support, e.g. Apache Airflow.

The blog post has a paragraph that could be interesting for this PR:

Dask interacts natively with our existing job schedulers (SLURM/SGE/LSF/PBS/…) so there is no additional system to set up and manage between users and IT. All the infrastructure that we need is already in place.

Planning to spend some training time until the end of the year to learn more about Dask, and understand how it can be used by tools and libraries.

hjoliver commented 5 years ago

I believe Met Office post processing people already use Dask inside Cylc suites (presumably to distribute single Python tasks across multiple cores). We should talk about this next week in Exeter.

matthewrmshin commented 5 years ago

E.g. Iris is configured to use Dask at MO. However, my believe is that Dask's primary focus is on work flow and parallelisation at the application or sub-application level, whereas Cylc's focus is on the level above.

1962 A Python API for configuring suite tasks will also help us integrate with Dask.

TomekTrzeciak commented 1 year ago

Another recent project aiming at this space is Portable Submission Interface for Jobs (PSI/J) - Python Library.