De-couple file upload/download from scheduler communication methods

chrisjsewell commented 2 years ago

Currently, aiida uses the same Transport plugin (e.g. direct or ssh) for uploading/downloading files to/from the Computer, as it does for communicating with a Scheduler via command executions (e.g. to poll for completed jobs).

There are potential use cases though where we may what separate methods to achieve these tasks; one example being that SLURM now has a REST API (https://slurm.schedmd.com/rest.html), and so maybe you want to upload files with SSH, then use the REST API to control SLURM.

This also came out of the meeting with @zhubonan, regarding a Fireworks scheduler

cc also @csadorf @giovannipizzi, correct me if I'm wrong in my takeaway?

csadorf commented 2 years ago

I think it would generally be a good idea to conceptually separate these two tasks. What is your estimate as to how much refactoring and restructuring that would require? Also, it might be worthwhile to concretize this in an AEP.

zhubonan commented 2 years ago

For the fireworks scheduler, the simple solution for this would be to expose the CalcJobNode to the Scheduler so the job size information can be retrieved this way.

At the moment, the scheduler had to download the submission script from the remote, parse it and submit the job to the launchpad server.

Below is the line for the current implementation:

https://github.com/zhubonan/aiida-fireworks-scheduler/blob/cf70259b557a895e35019ef17e2cfaeae6df4358/aiida_fireworks_scheduler/fwscheduler.py#L171-L206

chrisjsewell commented 2 years ago

What is your estimate as to how much refactoring and restructuring that would require?

Well let's say for now, I think it's plausible, but certainly not trivial. At least a few weeks of labour, so not likely to happen in the short-term 😬

giovannipizzi commented 2 years ago

Thanks @chrisjsewell for starting this issue (and @zhubonan for your presentation on the aiida-fireworks-scheduler!)

Here a few thoughts:

at the moment the design is:
1. the transport does not know anything about AiiDA (in principle, we could export it as a python package external to AiiDA if we wanted to)
2. the scheduler only knows about a transport, but not about AiiDA (and the information e.g. on number of CPUs etc is passed in a common data format (currently as a class wrapping a dict, but it's in principle all JSON-serializable and one could convert the classes - mostly for validation - into a JSON-schema): its' the JobTemplate and the JobInfo in this file
3. I would therefore be hesitant to pass an AiiDA object to the scheduler. On the other hand, I would still agree we provide a way to avoid connecting via SSH just to get back the submission script and parse it. I think this can be achieved just passing an appropriate data structure (not AiiDA-specific) - I'll think to it a bit more and write back as soon as I have a practical idea on how to do it
4. at least for the goals of Bonan, maybe it's not needed to decouple more the transport and scheduler. At the moment the scheduler gets as a parameter the transport, but can just ignore it in principle. And instead reimplement the functionality to manage the queue (e.g. via a REST API). I would think about redesigning the separation only if, e.g., we need to pass in a "standard way" more than one transport to a scheduler (e.g. one to copy files, one to connect, etc.). At the moment I think the approach of Bonan (just "ignore" the transport, except for the retrieval of the job options during submission, where we should change the API) is OK.

giovannipizzi commented 2 years ago

So, here is what I think is a simple way to implement this (at least to solve the issue of @zhubonan where, during the submission, he needs to SSH to the scheduler just to fetch back the script, and he has to parse it back):

add a new "feature" of a scheduler plugin (see e.g. example here), say "submit_via_script", that is True by default (if not defined) to keep the current behaviour. Bonan's plugin would set it to False instead
Put these four lines of the execmanager in an if statement to be executed if the submit_via_script feature is True: https://github.com/aiidateam/aiida-core/blob/287d1385884c1c77a942d4945506c406ee98acde/aiida/engine/daemon/execmanager.py#L331-L334
If the feature is False, just call a new method instead (that should be defined in the base Scheduler class and raising NotImplementedError by default, and to be (re)implemented by those plugin subclasses that set the feature to False). E.g. submit_from_job_template(self, workdir, job_tmpl) (the latter parameter being a JobTemplate instance, see also below - the same currently passed to e.g. _get_submit_script_header). This is defined by us but not an AiiDA ORM class, allowing to keep the AiiDA scheduler implementing transparent to who is calling it, be it AiiDA or some other code.
Note: for a plugin with the feature to False, one can just simply skip implementing the get_submit_script_header and the _get_submit_script_footer, so the script will just contain the mpirun line, the env vars etc, but no specific line with information on the resources (or one could just dump a summary of the JobResource if one wants to keep some logging). Or add only the few things that need to be set in the file itself (e.g. it might be easier to set the environment variables directly in the bash file, similarly to what the direct scheduler does, but leave the rest of the information on resources, wall time, ... for the submit_from_job_template function)
The only question that remains is how to pass the job_tmpl object in the execmananger to the new function submit_from_job_template. Here is what I suggest. Currently, the job_tmpl object is generated in the presubmit method of the CalcJob, and then passed here to the get_submit_script method of the scheduler. As I mentioned, I would keep this call (we need anyway to prepare the part of the script with the mpirun etc.). Then we need to store the content of this somewhere "inside" the CalcJobNode, but luckily this is already done a few lines below: the content is json-serialised and stored in the node repository, in the .aiida/job_tmpl.json file (this has been there from the very beginning of AiiDA). Since this is something that we control and is currently hardcoded, I think it is safe to assume it's always there (we might just want to add a comment in the code, to mention that the filename .aiida/job_tmpl.json should not be changed, as it is then used to parse this information back in the exec manager - so we don't forget about this). So, in conclusion I would replace these lines https://github.com/aiidateam/aiida-core/blob/287d1385884c1c77a942d4945506c406ee98acde/aiida/engine/daemon/execmanager.py#L331-L334 with the following ones (untested, just to give an idea):
```
workdir = calculation.get_remote_workdir()
if scheduler.get_feature('submit_via_script', True):
   submit_script_filename = calculation.get_option('submit_script_filename') 
   job_id = scheduler.submit_from_script(workdir, submit_script_filename)
else:
   # ADD HERE CODE TO: 
   # 1. get the file `.aiida/job_tmpl.json` from the repository of `calculation`
   # 2. serialise this back into a JobTemplate object job_tmpl
   job_id = scheduler.submit_from_job_template(workdir, job_tmpl)
calculation.set_job_id(job_id) 
```
as a final comment, to make this work properly, we need to extend the get_feature method of the scheduler to accept an additional optional parameter with the default value, when the key is not present, instead or raising a KeyError (but this is easy).

If you agree that this is a good approach, and @zhubonan confirms that this would solve his problem (move his code from submit_from_script to the new function submit_from_job_template, and avoid any SSH connection and parsing but get the information he needs directly from the job_tmpl), then it should be very easy to implement (and of course we need to add a few lines of documentation of this new feature). For @zhubonan - the important question is also if the job_tmpl contains all the information you need (but I would be surprised if this is not the case, since the information you parse should come directly from the job_tmpl).

zhubonan commented 2 years ago

@giovannipizzi Thanks, I think what you suggest should work well! - This would also make the plugin code more concise as there is no need to do a round-trip of generating a "fake" script during upload and parse it back in for submit

I would like to add that the firework scheduler still uses the transport object attached for getting the computer and username as the identifiers:

https://github.com/zhubonan/aiida-fireworks-scheduler/blob/cf70259b557a895e35019ef17e2cfaeae6df4358/aiida_fireworks_scheduler/fwscheduler.py#L171-L206

so one won't get jobs of other machines/other accounts in the same machine (it is possible that the user can have two accounts of the same machine, created as two separate Computer nodes). Getting this information does not require the transport being opened. The the proposed change would

E.g. these two lines below should be kept:

https://github.com/aiidateam/aiida-core/blob/287d1385884c1c77a942d4945506c406ee98acde/aiida/engine/daemon/execmanager.py#L328-L329

If we want the scheduler to work without the Transport then I think the Computer itself needs to be passed as an argument as it contains this information.

chrisjsewell commented 2 years ago

Note I'll probably be looking to do this, in conjunction with https://github.com/aiidateam/aiida-firecrest

giovannipizzi commented 2 years ago

I'm adding a comment to remember, when redesigning this, to take (at least) all the following use cases into account:

needs of aiida-fireworks by @zhubonan (and my comment above)
needs of aiida-firecrest developed by @chrisjsewell
needs of aiida-hyperqueue that @mbercx will soon start developing (at least one thing that I see: the option to specify the options to the scheduler not as lines in the submit script, but as command line arguments; I would still suggest to write those in the file to know what has been asked and not lose this information)
needs of the scheduler (for slurm, supporting job arrays) implemented as part of microsoft/aiida-dynamic-workflows developed by @jbweston

giovannipizzi commented 2 years ago

A few things learnt from aiida-hyperqueue (@mbercx edit this comment if I'm forgetting something), see also aiidateam/aiida-hyperqueue#2

we want the submission script to be executable: we could do it in any case, I think (it shouldn't harm in the current cases), or e.g. add a scheduler class attribute, e.g. _EXECUTABLE_SCRIPT, that if present will ask AiiDA to make the script executable
allow for (optionally) wrapping the strings (e.g. for the mpirun command) in double quotes and not in single quotes (being implemented here by @unkcpz : #5250)
(probably?) pass the job_tmpl also to the _get_submit_command function, so that if the options need to be specified on the cmdline rather than in the header of the submission file, this can be easily implemented (however, for easier debugging and provenance, we should suggest to put the directives in the script if possible?)

aiidateam / aiida-core

De-couple file upload/download from scheduler communication methods #5084