Open LuiggiTenorioK opened 10 months ago
changed due date to June 30, 2024
In GitLab by @mcastril on Jan 5, 2024, 19:42
Thank you for the documentation Luiggi. Regarding the infrastructure cases and action procedures, we have to keep the portability and interoperability of Autosubmit and its API. Anyway, ES environment, EDITO Infra or Climate DT one are different enough to provide some general enough specifications for a system that must be compliant with the three environments.
In GitLab by @kinow on Jan 18, 2024, 10:23
From today's meeting:
In GitLab by @mcastril on Jan 18, 2024, 17:17
I agree with the plan for EDITO. In a broad way, these are the summarized requirements.
run
, setstatus
, stop
(there is an issue in Autosubmit about the implementation of a new command)create
, expid
, recovery
setstatus
can be triggered with a file modification: https://autosubmit.readthedocs.io/en/master/userguide/manage/index.html#how-to-change-the-job-status-without-stopping-autosubmit
For the stop
and the recovery
, we could use the same approach if we implement the same behavior in Autosubmit (by using files).
For run
, create
, or expid
it's more tricky as Autosubmit is not running to look for and consume the file.
One alternative is to deploy a daemon
looking for these files and spawning an Autosubmit process but we could end up in the same issue that the API has to start a process under the user's identity.
In GitLab by @kinow on Jan 22, 2024, 10:37
mentioned in issue autosubmitreact#90
Going back to this. I opened an issue (#58) with a design that could handle the run
and stop
operations without deploying a daemon by having a higher-level API that maps the nodes executing those processes.
But, in the design, I assume that the current API will call the Autosubmit command autosubmit run
using its latest version and opening an independent process. This is something that wasn't done before as Autosubmit wasn't necessarily installed in the same node of the API and they were connected just by the file system.
@kinow @dbeltrankyl I wanted to ask if this is a feasible strategy or if I'm missing a potential issue by calling Autosubmit CLI commands from the API environment.
In GitLab by @kinow on Jan 23, 2024, 17:47
I am not sure if that would work well. There are potential issues with the Autosubmit version. e.g. we changed the pickle or configuration parsing, and know that an experiment needs adjusting before it can be used with the latest version. Now we need to know what is the version of Autosubmit to use to launch it. I think we will really need a few sessions on the whiteboard to discuss possible scenarios, like user deleted/left the company, experiment was archived (maybe we still want to show in the UI and unarchive?), how/if it will handle restarting experiments of others, etc.
After using the whiteboard it should be more clear (at least for me) what are the limitations, and how this should work.
In GitLab by @mcastril on Jan 31, 2024, 18:56
You are right that there are many aspects to consider.
Maybe we can separate the problem in two parts: the "interactive" endpoints and synchronizing remote environments. The second is interesting for many reasons, not only this one but also our medium-term goal to synchronize workflows running in independent environments and set dependencies between their tasks.
The daemon issue for me is independent of the higher-level API. That was a way to allow interaction with AS by just writing files. If the API can call Autosubmit commands then the daemon is not needed anyway, but this is independent from the synchronization IMO.
Regarding the Autosubmit version, at least we store this value in the DDBB and in the config, and Autosubmit alerts the user when they intend to run an experiment with a different version. We can port the feature to the GUI/API and then directly run the experiment with -v
in case the user approves the version change.
Right! There are different problems to solve. Adding the interactive endpoints is a must for sure. On the other hand, we have to find a way to handle experiments of different versions in different environments.
For the different versions issue, I think Miguel is right that we can use the -v
flag to solve it.
For the different environment issues, there are the daemon synchronization and the higher-level API solutions. IMO the synchronization solution is way more complex, especially considering the different versions of the experiments. Then, having a higher-level API will work better for EDITO as it will only need one service from where the requests will be made (SURF).
This higher-level API solution is inspired by another project used in the most popular workflow manager in bioinformatics (https://galaxyproject.org/) which uses a similar lower-level API called Pulsar to solve the same problem we have.
Our problem is more or less stated here: https://pulsar.readthedocs.io/en/latest/containers.html
(I remember stating similar issues in my Master's thesis)
From today's meeting:
In GitLab by @kinow on Feb 2, 2024, 10:38
Thank you for attaching it here, @LuiggiTenorioK !
@mcastril , should one of us get in touch with EDITO/SURF to schedule a meeting to discuss about this? If so, maybe it would be Quentin/Renaud and Francesco from CMCC via email explaining what we want to discuss and asking for best time/day for the meeting? Thank you
In GitLab by @mcastril on Feb 7, 2024, 12:33
Yes Bruno, thanks for volunteering. Please refer to Renaud, Quentin and Francesco together, with us in copy.
In GitLab by @kinow on Feb 13, 2024, 11:57
Meeting doodle poll sent.
Summarizing what we discussed in the meeting.
In GitLab by @kinow on Feb 29, 2024, 11:26
Do we have a list of requirements, or how SURF GUI will interact with the API? This would be useful to validate the endpoints we will have to implement.
Use the API to request the status of the experiment that is running. We can build a list of experiments the user is running, and show how long the experiment is taking, resources used...
When the user defines what they want to submit, at that point they submit the list of tasks and the run of the job starts.
Users must be able to restart from a certain point in the workflow (setstatus
).
Q: Can we get the list of N
last experiments?
Yes, but the deployment option could define how it works.
Possible endpoints required:
In GitLab by @kinow on Feb 29, 2024, 11:52
Are we going to have a single API instance, shared by all the EDITO-Infra users, or are we going to have one API per user? Or both?
Both ways are possible, but it's on the business side to choose.
We need to choose between Process and Service (in EDITO Infra).
In a Service you can launch multiple tools (GUI + AS API + Surf API + etc).
Not tested, but it should be possible to have dependencies between Services.
In Datalab we have "Projects". We could create the project "Edito ModelLab". You can create an instance with members of the project. You can also share the URL of the project with members outside the project.
If we have instances/containers for the API shared by users, which user could we use to run the API? Would it have access to S3 and to an SSH key to connect to HPC or other EDITO-Infra instances?
This needs to be tested to confirm. It should be possible to share the instance so others can manage it too.
How are the API and GUI instances going to be started? By a user action, or Kubernetes will keep a minimum number of pod(s) running?
In the project we can have services that are always available. At the moment +2 weeks old services are killed, but this may change in the future.
Q: Who maintains the infra (if a service goes down?)
Suggestion: use replication (pods/etc) to have more resources, configure the helm/etc to have higher availability.
Q: Surf GUI can use the API
In GitLab by @kinow on Feb 29, 2024, 12:00
Q: Can I deploy to another service/catalogue/env?
At the moment merge requests go to production. Staging is for EDITO-Infra. EDITO-Infra team is working to give others access to the playground catalogue/env.
N.B.: BSC team to use the playground. Then later we ask it to be moved to the modellab/ai/etc catalogue.
We need to define how/if we will use a shared file-system for Autosubmit experiments. For the demo we used S3, but that's not really an option for Autosubmit (we use NFS at ClimateDT and BSC). Maybe we could allocate a permanent volume to be bound to each Autosubmit container (with enough storage for the experiments? GB's, TB's?)
At the moment this is not doable. But that should be possible under the common modellab project. So all users under that project can target that volume there. We can test that after the modellab is created.
Q: are we going to give access to external users, to use this database (on the shared docker volume)?
...
In GitLab by @kinow on Feb 29, 2024, 12:04
Action: we need to define who are the users of the model lab too. At the moment anyone can request an account to EDITO. The Catalogue Service view is available to unauthenticated users.
changed due date to December 31, 2024
Summarizing the discussion about the goal of this year, we are aiming to add more interactive endpoints to the API that will allow users to trigger actions that can modify the state of the experiments. To accomplish this there are some issues we have to solve:
Define the scope
We need to list requirements to better structure the changes we want to make. In this particular case, we can list the actions (run experiment, update description, change status, etc) we desire to include inside the API. Then, we can make a formal endpoint definition in OpenAPI with the route, expected request, and response.
Also, this will help us to link the effort we have to do in other tasks (DDBB sync, security, communication with Autosubmit, ...)
[UPDATE] Mapped actions until now:
POST /v4/experiments/<expid>?action=run
->start
POST /v4/experiments/<expid>?action=stop
->stop
PATCH /v4/experiments/<expid>/jobs/<jobid>?status=<newstatus>
->setstatus
POST /v4/experiments
->expid
POST /v4/experiments/<expid>?action=generate
->create
POST /v4/experiments/<expid>?action=restart
->recovery
Set some infrastructure cases
There are multiple scenarios in which the API will be installed like in ES, Climate DT, and EDITO. Is important that we formally define those to better understand the bounds (security, network, dependencies) we are going to have for each one.
Define the action procedures
As discussed, there are some options to process the actions we want to include inside the API:
@mcastril @kinow