aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
414 stars 185 forks source link

De-couple file upload/download from scheduler communication methods #5084

Open chrisjsewell opened 2 years ago

chrisjsewell commented 2 years ago

Currently, aiida uses the same Transport plugin (e.g. direct or ssh) for uploading/downloading files to/from the Computer, as it does for communicating with a Scheduler via command executions (e.g. to poll for completed jobs).

There are potential use cases though where we may what separate methods to achieve these tasks; one example being that SLURM now has a REST API (https://slurm.schedmd.com/rest.html), and so maybe you want to upload files with SSH, then use the REST API to control SLURM.

This also came out of the meeting with @zhubonan, regarding a Fireworks scheduler

cc also @csadorf @giovannipizzi, correct me if I'm wrong in my takeaway?

csadorf commented 2 years ago

I think it would generally be a good idea to conceptually separate these two tasks. What is your estimate as to how much refactoring and restructuring that would require? Also, it might be worthwhile to concretize this in an AEP.

zhubonan commented 2 years ago

For the fireworks scheduler, the simple solution for this would be to expose the CalcJobNode to the Scheduler so the job size information can be retrieved this way.

At the moment, the scheduler had to download the submission script from the remote, parse it and submit the job to the launchpad server.

Below is the line for the current implementation:

https://github.com/zhubonan/aiida-fireworks-scheduler/blob/cf70259b557a895e35019ef17e2cfaeae6df4358/aiida_fireworks_scheduler/fwscheduler.py#L171-L206

chrisjsewell commented 2 years ago

What is your estimate as to how much refactoring and restructuring that would require?

Well let's say for now, I think it's plausible, but certainly not trivial. At least a few weeks of labour, so not likely to happen in the short-term 😬

giovannipizzi commented 2 years ago

Thanks @chrisjsewell for starting this issue (and @zhubonan for your presentation on the aiida-fireworks-scheduler!)

Here a few thoughts:

giovannipizzi commented 2 years ago

So, here is what I think is a simple way to implement this (at least to solve the issue of @zhubonan where, during the submission, he needs to SSH to the scheduler just to fetch back the script, and he has to parse it back):

If you agree that this is a good approach, and @zhubonan confirms that this would solve his problem (move his code from submit_from_script to the new function submit_from_job_template, and avoid any SSH connection and parsing but get the information he needs directly from the job_tmpl), then it should be very easy to implement (and of course we need to add a few lines of documentation of this new feature). For @zhubonan - the important question is also if the job_tmpl contains all the information you need (but I would be surprised if this is not the case, since the information you parse should come directly from the job_tmpl).

zhubonan commented 2 years ago

@giovannipizzi Thanks, I think what you suggest should work well! - This would also make the plugin code more concise as there is no need to do a round-trip of generating a "fake" script during upload and parse it back in for submit

I would like to add that the firework scheduler still uses the transport object attached for getting the computer and username as the identifiers:

https://github.com/zhubonan/aiida-fireworks-scheduler/blob/cf70259b557a895e35019ef17e2cfaeae6df4358/aiida_fireworks_scheduler/fwscheduler.py#L171-L206

so one won't get jobs of other machines/other accounts in the same machine (it is possible that the user can have two accounts of the same machine, created as two separate Computer nodes). Getting this information does not require the transport being opened. The the proposed change would

E.g. these two lines below should be kept:

https://github.com/aiidateam/aiida-core/blob/287d1385884c1c77a942d4945506c406ee98acde/aiida/engine/daemon/execmanager.py#L328-L329

If we want the scheduler to work without the Transport then I think the Computer itself needs to be passed as an argument as it contains this information.

chrisjsewell commented 2 years ago

Note I'll probably be looking to do this, in conjunction with https://github.com/aiidateam/aiida-firecrest

giovannipizzi commented 2 years ago

I'm adding a comment to remember, when redesigning this, to take (at least) all the following use cases into account:

giovannipizzi commented 2 years ago

A few things learnt from aiida-hyperqueue (@mbercx edit this comment if I'm forgetting something), see also aiidateam/aiida-hyperqueue#2