Open chrisjsewell opened 2 years ago
I think it would generally be a good idea to conceptually separate these two tasks. What is your estimate as to how much refactoring and restructuring that would require? Also, it might be worthwhile to concretize this in an AEP.
For the fireworks scheduler, the simple solution for this would be to expose the CalcJobNode
to the Scheduler
so the job size information can be retrieved this way.
At the moment, the scheduler had to download the submission script from the remote, parse it and submit the job to the launchpad server.
Below is the line for the current implementation:
What is your estimate as to how much refactoring and restructuring that would require?
Well let's say for now, I think it's plausible, but certainly not trivial. At least a few weeks of labour, so not likely to happen in the short-term 😬
Thanks @chrisjsewell for starting this issue (and @zhubonan for your presentation on the aiida-fireworks-scheduler!)
Here a few thoughts:
So, here is what I think is a simple way to implement this (at least to solve the issue of @zhubonan where, during the submission, he needs to SSH to the scheduler just to fetch back the script, and he has to parse it back):
if
statement to be executed if the submit_via_script
feature is True: https://github.com/aiidateam/aiida-core/blob/287d1385884c1c77a942d4945506c406ee98acde/aiida/engine/daemon/execmanager.py#L331-L334NotImplementedError
by default, and to be (re)implemented by those plugin subclasses that set the feature to False). E.g. submit_from_job_template(self, workdir, job_tmpl)
(the latter parameter being a JobTemplate instance, see also below - the same currently passed to e.g. _get_submit_script_header
). This is defined by us but not an AiiDA ORM class, allowing to keep the AiiDA scheduler implementing transparent to who is calling it, be it AiiDA or some other code.get_submit_script_header
and the _get_submit_script_footer
, so the script will just contain the mpirun
line, the env vars etc, but no specific line with information on the resources (or one could just dump a summary of the JobResource if one wants to keep some logging). Or add only the few things that need to be set in the file itself (e.g. it might be easier to set the environment variables directly in the bash file, similarly to what the direct scheduler does, but leave the rest of the information on resources, wall time, ... for the submit_from_job_template
function)submit_from_job_template
. Here is what I suggest. Currently, the job_tmpl object is generated in the presubmit
method of the CalcJob, and then passed here to the get_submit_script
method of the scheduler. As I mentioned, I would keep this call (we need anyway to prepare the part of the script with the mpirun etc.). Then we need to store the content of this somewhere "inside" the CalcJobNode, but luckily this is already done a few lines below: the content is json-serialised and stored in the node repository, in the .aiida/job_tmpl.json
file (this has been there from the very beginning of AiiDA). Since this is something that we control and is currently hardcoded, I think it is safe to assume it's always there (we might just want to add a comment in the code, to mention that the filename .aiida/job_tmpl.json
should not be changed, as it is then used to parse this information back in the exec manager - so we don't forget about this). So, in conclusion I would replace these lines https://github.com/aiidateam/aiida-core/blob/287d1385884c1c77a942d4945506c406ee98acde/aiida/engine/daemon/execmanager.py#L331-L334
with the following ones (untested, just to give an idea):
workdir = calculation.get_remote_workdir()
if scheduler.get_feature('submit_via_script', True):
submit_script_filename = calculation.get_option('submit_script_filename')
job_id = scheduler.submit_from_script(workdir, submit_script_filename)
else:
# ADD HERE CODE TO:
# 1. get the file `.aiida/job_tmpl.json` from the repository of `calculation`
# 2. serialise this back into a JobTemplate object job_tmpl
job_id = scheduler.submit_from_job_template(workdir, job_tmpl)
calculation.set_job_id(job_id)
get_feature
method of the scheduler to accept an additional optional parameter with the default value, when the key is not present, instead or raising a KeyError (but this is easy).If you agree that this is a good approach, and @zhubonan confirms that this would solve his problem (move his code from submit_from_script
to the new function submit_from_job_template
, and avoid any SSH connection and parsing but get the information he needs directly from the job_tmpl), then it should be very easy to implement (and of course we need to add a few lines of documentation of this new feature).
For @zhubonan - the important question is also if the job_tmpl contains all the information you need (but I would be surprised if this is not the case, since the information you parse should come directly from the job_tmpl
).
@giovannipizzi Thanks, I think what you suggest should work well! - This would also make the plugin code more concise as there is no need to do a round-trip of generating a "fake" script during upload and parse it back in for submit
I would like to add that the firework scheduler still uses the transport object attached for getting the computer and username as the identifiers:
so one won't get jobs of other machines/other accounts in the same machine (it is possible that the user can have two accounts of the same machine, created as two separate Computer
nodes). Getting this information does not require the transport being opened. The the proposed change would
E.g. these two lines below should be kept:
If we want the scheduler to work without the Transport
then I think the Computer
itself needs to be passed as an argument as it contains this information.
Note I'll probably be looking to do this, in conjunction with https://github.com/aiidateam/aiida-firecrest
I'm adding a comment to remember, when redesigning this, to take (at least) all the following use cases into account:
aiida-fireworks
by @zhubonan (and my comment above)aiida-firecrest
developed by @chrisjsewell aiida-hyperqueue
that @mbercx will soon start developing (at least one thing that I see: the option to specify the options to the scheduler not as lines in the submit script, but as command line arguments; I would still suggest to write those in the file to know what has been asked and not lose this information)A few things learnt from aiida-hyperqueue (@mbercx edit this comment if I'm forgetting something), see also aiidateam/aiida-hyperqueue#2
_get_submit_command
function, so that if the options need to be specified on the cmdline rather than in the header of the submission file, this can be easily implemented (however, for easier debugging and provenance, we should suggest to put the directives in the script if possible?)
Currently, aiida uses the same Transport plugin (e.g. direct or ssh) for uploading/downloading files to/from the Computer, as it does for communicating with a Scheduler via command executions (e.g. to poll for completed jobs).
There are potential use cases though where we may what separate methods to achieve these tasks; one example being that SLURM now has a REST API (https://slurm.schedmd.com/rest.html), and so maybe you want to upload files with SSH, then use the REST API to control SLURM.
This also came out of the meeting with @zhubonan, regarding a Fireworks scheduler
cc also @csadorf @giovannipizzi, correct me if I'm wrong in my takeaway?