OpenFreeEnergy / alchemiscale-fah

protocols and compute service for using alchemiscale with Folding@Home
MIT License
2 stars 0 forks source link

FAHComputeService for execuing FAH-based Protocols via a Folding@Home work server #1

Closed dotsdl closed 3 months ago

dotsdl commented 1 year ago

Implement a FAH-oriented compute service that utilizes a Folding@Home work server to execute the simulation ProtocolUnits of ProtocolDAGs produced by the FAH-specific protocols implemented in this library.

This compute service should efficiently execute multiple Tasks at once, perhaps with a combination of ProcessPoolExecutors and asyncio, awaiting results from the work server and processing them as they arrive.

dotsdl commented 1 year ago

Evaluating options for alchemiscale-fah execution.

  1. Each Transformation corresponds to a RUN, each task corresponds to a CLONE, with extended Tasks only supported in chains corresponding to GENs
    • works directly within the FAH model, offering some efficiency in running many sampling repeats for a given Transformation
    • requires strict constraints on the form DAGs can take, in particular that there can be only one SimulationUnit, since this corresponds to a CLONE and we want 1:1 matching of Tasks to CLONEs to allow extension via GENs
    • restricts extension of a Task to one other Task only (forming a chain); multiple extensions from the same Task (forming a tree) not supportable under this model, since GENs are linear
    • requires some concurrency bookkeeping in compute service to handle RUN/CLONE creation
    • would perform some unorthodox skipping of DAG units for Tasks that correspond to CLONEs on an existing RUN
    • traditional analysis tools running on work server would be able to make sense of the data structured there
  2. Each Task corresponds to a separate RUN, regardless of extension
    • conceptually far simpler
    • does mean that RUNs rapidly proliferate within PROJECTs; will need to ask Joseph if this causes problems for FAH
    • doesn't touch CLONEs or GENs, except to create 0th ones for each RUN
    • requires no weird gymnastics on DAG execution, no coordination of DAG execution within RUNs, etc.
    • allows for full usage of alchemiscale Task model as usual, no restriction on extensions (trees of extensions work)
    • does still restrict on DAG structure to only 1 SimulationUnit
  3. Each SimulationUnit corresponds to a separate RUN
    • like (2) but goes further to allow multiple SimulationUnits in a DAG
    • (2) is the degenerate form of this option for DAGs that feature only one SimulationUnit
    • if number of RUNs in a project grows too large, may need to switch use to another created project

Pursuing option (3) for now, since this affords us the greatest flexibility and is conceptually simplest (and possibly simplest in implementation). If we encounter problems, then we can pursue other options.

Thoughts on this @jchodera? Does option (3) present obvious problems to you?

jchodera commented 1 year ago

@dotsdl : You'll want to check with Joseph Coffland on the maximum number of RUNs, CLONEs, and GENs. I suspect these are int32, so you are limited to 32767 (or possibly 65536 if they are actually uint32) of each. This means you may want to choose a variant of Option 3 where you use RUNs and CLONEs, giving you int64 worth of actual tasks. In fact, if you use GENs as well, this gives you int96 to play with.

There is no real need to hew to the traditional conception of CLONEs as being different replicates---each CLONE can still be a completely different (System, State, Integrator) tuple. In principle, each RUN can as well.

We'll still want to have a variety of PROJECTs set up with different system sizes so we don't make volunteers too upset by run-to-run variation. We can start with a single PROJECT for debugging, but then perhaps set up a group of 10 projects of varying size to use.

Happy to chat more online to help navigate the choices here.

jchodera commented 1 year ago

One other consideration may be that it might be inconvenient to end up 65636 files in each directory if you make everything a RUN. In fact, I'm not even sure ext3 supports that. So it might be useful to spread things out between at least RUNs and CLONEs.

dotsdl commented 1 year ago

Thank you for these insights @jchodera! We may be able to take advantage of CLONEs then to avoid too filesystem issues, and also give a larger space for jobs within a project. I'll consider this in my approach.

I have several new questions as I'm moving forward in #7:

  1. if we wanted to dynamically/algorithmically create projects on a given work server (perhaps given a range of available PROJECT IDs), how would you do it?
    • what is the range of atom numbers RUNs in a given PROJECT should have given its nominal atom count value?
    • what manual steps are required here, from e.g. Joseph, when a project is created?
    • do we need to populate the project.xml ourselves, or is this done by the adaptive sampling API? This is a question for @jcoffland.
  2. in the openmm-core do nonequilibrium works get written to globals.csv?
  3. is it necessary to restart the fah-work service to get it to see new PROJECTs?
jchodera commented 1 year ago
  1. Project handling:
    • Questions for how to create a new project RESTfully should probably be directed to Joseph. I thought there was a spec document that described all of the available API commands, but I was under the impression it focused on starting and stopping RUNs/CLONEs/GENs. The decision to not allow API-based PROJECT creation may be a safety one---because we are limited by an INT32, I think we are fundamentally limited to 65536 projects in the lifetime of Folding@home, meaning a rogue script could potentially end FAH forever in a few seconds by using up all available projects. A better approach may be to automate the setup of a limited number of PROJECTs with a script and allow the executor access to those projects, which constrains blast damage. If something goes wrong, the project's RUNS/CLONES/GENS can much more easily be wiped (another Joseph question) and relaunched from the same server.
    • We might want 10 or 20 projects that span the range of system sizes we are interested in, assuming we have roughly the same nonbonded settings for everything. That would hopefully mean that the systems within each project are close enough in performance that the points distribution will not be too crazily broad within a project.
    • For testing, I'd maybe use 1-2 projects
  2. If you're using a nonequilibrium cycling integrator, which stores the accumulated work values in the integrator global values, the nonequilibrium work values are accumulated in globals.csv. I can't recall if this path is hard-coded or not.
  3. Yes, you need to restart fah-work to see a new project that has been added to the filesystem.

It would be good to work with @sukritsingh, who knows a whole lot about how this stuff works and can help resolve questions and debug things.

sukritsingh commented 1 year ago

Happy to help in any way I can! Just building off of John's comments about the questions asked:

  1. I don't think it's noted here but just for the record the RESTful API for setting up projects is documented by Joseph in a google doc. There does appear to be a PUT API endpoint:
    Method: PUT
    Send: ProjectData
    Create a new project or update an existing project's configuration.

    This would probably be the "best" way to programmatically add projects, runs, clones, etc. However, like John said it would be best to check with Joseph if you intend on adding many multiples of projects in case there are integer limitations.

what is the range of atom numbers RUNs in a given PROJECT should have given its nominal atom count value?

  1. Like John said above the atom numbers for a single PROJECTs runs should be as similar as possible. From experience, I've found that for a given protein target system + Box, variations in ligand don't affect benchmarking points as much as box size or target size do. This is all very uncharted territory so some benchmarking may be required!

what manual steps are required here, from e.g. Joseph, when a project is created?

  1. Ultimately, I think the only manual thing that would need to be run upon using the API would be to manually restart the work server using sudo service fah-work restart and edit the constraint settings on the Assignment Server (ie for internal and beta testing).

I know the API has some endpoints for ASProjectData that allow you to put constraints directly, but I have no experience myself, but Joseph may have helpful advice here! I think the intention is to allow us to modify project constraints without needing to log into the Assignment server!

the nonequilibrium work values are accumulated in globals.csv. I can't recall if this path is hard-coded or not.

  1. I think for the moment these are hardcoded in the sense that the work values are stored within the integrator object as a global parameter, and written out as part of the integrator's globals.csv. We'd need to alter the nonequilibrium integrator to have it write out elsewhere.

There was also briefly a discussion about including a way to get custom Reporter or customCVForce outputs from the core, which may help here with appropriate changes to the integrator, state, and system, but I believe those are still in development from the OpenMM-core side.

dotsdl commented 1 year ago

Thank you both! This has been very helpful in resolving our approach.

alchemiscale reference terms:

Here is my current proposal for how we will interface the alchemiscale/gufe execution model with the Folding@Home execution model:

Please point out any problems you see with this approach, or any invalid assumptions.

I also have several questions related to the above:

  1. @sukritsingh: when we create a project, what does it mean to set the runs, clones, and gens counts in the project.xml? If we are going to be telling the WS to create new CLONEs for specific RUNs via the API, it doesn't make sense to me what these counts are even for.
  2. @jcoffland: I noticed that the ProjectData model in the adaptive sampling API doc doesn't include fields like send or create-command found in a project.xml. How should we populate these if we create a project using the API?
  3. @jchodera: regarding nonbonded settings, since this influences performance (and therefore volunteer points) alongside atom counts, how best might we encode this information in the project?
    • perhaps an additional file with this info, but what would be most relevant to include there?
  4. @jcoffland: are we actually able to delete old RUNS/CLONES/GENS? What's the cleanest way to do this if so? The WS API allows for file deletion, but doesn't appear to allow for deletion of these entities in an explicit way.
  5. @jcoffland: what would be the proper way to create a new RUN in a PROJECT using the adaptive sampling API? There is only one WS endpoint for RUNs, and it is used to apply a JobAction to them; presumably they must exist first?
  6. @jcoffland: is it possible for CLONEs to have their files populated on creation via the WS API, or must these files be populated by the create-command defined in the project.xml? Is it problematic for us to create a CLONE and then populate files in it?
  7. @jcoffland: is it possible to imperatively create GENs using the adaptive sampling API? Or is GEN creation strictly defined by the next-gen-command in the project.xml?
sukritsingh commented 1 year ago

when we create a project, what does it mean to set the runs, clones, and gens counts in the project.xml?

So the PROJECT, RUN, CLONE, GEN (PRCG) system is a way to track and identify any individual work unit in the FAH ecosystem. When you are setting up a project.xml you are setting up how many possible RUNs you have, how many CLONEs each RUN should have, and how many GENs each CLONE consists of. This is a hierarchical indexing – each GEN is a single work unit, and multiple GENs (ie generations) are stitched together to form a single CLONE. All the CLONEs correspond to a single RUN.

A practical example I often use - if I was running an equilibrium MD simulation dataset, I would set up a single PROJECT, where each RUN corresponds to a unique starting structure/pdb file/configuration of some kind. These are generally all meant to be same/similar number of atoms, but may have small variations in conformations, ligand placement, etc. The value in project.xml tells the WS how many RUNs there are.

Each CLONE would then be a unique initialization of the corresponding RUN (ie. unique initial velocities). The number of unique CLONEs per RUN are specified by the value in project.xml.

Each CLONE then has a latest GEN that is the start of a fresh work unit (ie trajectory) that is sent to a client who runs the complete trajectory, is sent back, and becomes the next GEN for the same CLONE. The number in project.xml thus specifies how long each CLONE would be.

I think at some point I had given a group meeting presentation on this to the chodera lab, if you are able to find it! I don't think I can link it here publicly at the moment, but I'll check!

If we are going to be telling the WS to create new CLONEs for specific RUNs via the API, it doesn't make sense to me what these counts are even for.

Telling the WS to create new CLONEs means that you are telling the WS to add more trajectories for a specific RUN (ie in a case where you need more statistics). If you just update the project.xml then it'll blindly add more CLONEs to all the RUNs so I'm not entirely sure how to add more CLONEs for a particular run except for deleting/migrating those raw data files out of the relevant data folder so that the WS thinks those CLONEs need to be generated.

Some thoughts and clarifications I'm curious about:

use the adaptive sampling API to perform its interactions with the WS and the AS, avoiding the need to restart the WS after PROJECT creation operations

I would double check if the API does not require you to restart the WS (by restart that just means restart the WS service to be clear, just running sudo service fah-work restart) when adding RUNs or editing CLONEs. I was under the (false or outdated) impression that all the API edits to project.xml as needed when adding clones and so ultimately the service would need restarting (although it's a single command so that may be trivial to solve anyways).

RUNs will be created within PROJECTs as needed, on-demand, without requiring the WS service to restart, when a Transformation is encountered for the first time by the WS

Just making sure we're clear on the terminology translating between FAH and alchemiscale: Could you also clarify what a protocolUnit would be? In Free energy calculation terms, assuming this would be running non-equilibrium cycling for now (since RepEx is harder), then a single work unit has usually been a single cycle. Would a single protocolUnit then just be a single work unit running a single cycle? This means that each CLONE only ever has a single GEN.

In the context of a single free energy calculation dataset, I'm imagining CLONEs being just additional switching cycles for a single RUN/project (with each GEN identifying one of the unique cycles, as mentioned above).

upon a CLONE being deleted, its ID within its RUN can be reused

I'm assuming by deleted you mean that you would be removing the files from the data directory entirely? That's the only way I'm aware of that tells the WS that the CLONE ID is empty.

dotsdl commented 1 year ago

@sukritsingh and I met on Friday, 10/27, and discussed many of the points above. We agreed that there may be several functionality gaps in the adaptive sampling API to enable my proposed operation model above. We will seek to discuss directly with @jcoffland to see what is possible and report back here.

dotsdl commented 1 year ago

More detailed notes from meeting with @sukritsingh last week:

  1. We want to completely avoid the need to restart the fah-work service after performing operations against its API. This should in theory be possible, since the fah-work shouldn't need to perform filesystem scans to understand what changed, since it made any changes itself. We don't want the FahAsynchronousComputeService to need to talk to systemd on the host to kick another service repeatedly.
  2. Being able to create CLONEs and GENs imperatively a per-RUN basis would be useful to @sukritsingh outside of the context of alchemiscale.
  3. Being able to obtain an API cert with an expiry more than a month would be ideal once we have things up and running, since this is a manual process currently.
hmacdope commented 1 year ago

More detailed notes from meeting with @sukritsingh last week:

  1. We want to completely avoid the need to restart the fah-work service after performing operations against its API. This should in theory be possible, since the fah-work shouldn't need to perform filesystem scans to understand what changed, since it made any changes itself. We don't want the FahAsynchronousComputeService to need to talk to systemd on the host to kick another service repeatedly.

From my understanding the AS can pick up changes in clones, runs and gens automatically if you edit the corresponding project.xml or filesystem without a service restart, IIRC (could be wrong). Just thought i would add obs from recent projects I set up.

jcoffland commented 1 year ago
  1. @jcoffland: I noticed that the ProjectData model in the adaptive sampling API doc doesn't include fields like send or create-command found in a project.xml. How should we populate these if we create a project using the API?

This is intentional. For security reasons the API is not able to set any options that could be used for arbitrary remote command execution. You can set defaults for all projects per core in the WS' /etc/fah-work/config.xml.

  1. @jcoffland: are we actually able to delete old RUNS/CLONES/GENS? What's the cleanest way to do this if so? The WS API allows for file deletion, but doesn't appear to allow for deletion of these entities in an explicit way.

You can delete an entire project. You can "restart" a CLONE or all the CLONEs in a RUN. This will not delete the files immediately but they will be replaced. I could add a delete CLONE/RUN API end point if you need it. Deleting each file would be tedious.

  1. @jcoffland: what would be the proper way to create a new RUN in a PROJECT using the adaptive sampling API? There is only one WS endpoint for RUNs, and it is used to apply a JobAction to them; presumably they must exist first?

Runs do not need to exist before applying the "create" action. This takes parameters clones and offset. clones is the number of clones to create and offset the number of the first clone to create. This will cause create-command to be called for each of these clones so the files must already exist on the WS.

  1. @jcoffland: is it possible for CLONEs to have their files populated on creation via the WS API, or must these files be populated by the create-command defined in the project.xml? Is it problematic for us to create a CLONE and then populate files in it?

The files needed by create-command must be uploaded to the WS before the clone is created. Once the clone is created it could start assigning right away if the project is activated. There is currently no way to create a job and put it immediately in to the "stopped" state.

  1. @jcoffland: is it possible to imperatively create GENs using the adaptive sampling API? Or is GEN creation strictly defined by the next-gen-command in the project.xml?

The next-gen-command is run on any WU that is returned if there are more gens in the trajectory. All the files needed by this command should be in place.

Are you looking for a job queuing system that works something like this?:

You could treat the WS this way but then we are shoehorning a more basic queuing system into the F@H's traditional RUN/CLONE/GEN system. Also, downloading, analyzing and then reuploading the data for each WU is costly. It would be most effective if the bulk of the data analysis could be performed on the WS itself or even on F@H clients.

How often do you need to analyze result data? After every gen?

jcoffland commented 1 year ago

do we need to populate the project.xml ourselves, or is this done by the adaptive sampling API? This is a question for @jcoffland.

project.xml is populated automatically, except that you can/should set defaults via /etc/fah-work/config.xml.

jcoffland commented 1 year ago

when a CLONE finishes, and after its results have been processed, its results can be archived elsewhere and it can be deleted from the WS ... RUN IDs of deleted RUNs can be reused by new RUNs within a PROJECT for newly-encountered Transformations

This will not work. A particular PRCG should exists only once in F@H. A PRCG should only be credited once. Note that there are max 2^16 clones x 2^16 runs so you probably want to use gens.

You could use test-command and return the special value 1004 (stop job) to cause every WU to stop processing and not run next-gen-command. However, this only works correctly in the next WS v10.3.5.

jcoffland commented 1 year ago

I would double check if the API does not require you to restart the WS (by restart that just means restart the WS service to be clear, just running sudo service fah-work restart) when adding RUNs or editing CLONEs.

You do not need to restart the WS when using the API.

Being able to obtain an API cert with an expiry more than a month would be ideal once we have things up and running, since this is a manual process currently.

You can programmatically renew the cert via the API. You just need to do so before it expires. An as yet unwritten API wrapper in Python should do this automatically.

dotsdl commented 1 year ago

Thank you so much for this @jcoffland! This has helped me understand a lot better what is actually possible here.

Yes, we are effectively trying to solution a simple job queue as you describe using the PRCG system. Would it be possible for the WS adaptive sampling API to expose an alternative scheme to PRCG, such as PJ (PROJECT, JOB)? This would remove the need for our compute service to externally shoehorn this pattern into the PRCG model.

If we did continue with casting this behavior into the PRCG model on our own, it could still be done I think. A few more clarifying questions:

  1. Could we create CLONEs in the $home directory for the PROJECT on a per-RUN basis? Then have the create-command be something like:
    • ln -s -f $home/RUNS/RUN$run/CLONE$clone/state.xml.bz2 $jobdir/state.xml.bz2
    • this would enable the proposal above
  2. send must then be defined in a similar way, assuming we can use $clone like above?
    • depends on guaranteed predictability of CLONE ID, since we'd need to deposit files for a given RUN+CLONE before the CLONE exists; assume it increments by 1 each time within a given RUN?
  3. Is it possible to tell the WS to create a GEN on a per-CLONE basis imperatively? Or is this only defined globally for all RUNs+CLONEs in the project.xml via gens?
    • trying not to rule out use of GENs here, but so far not sure they'd work for the proposal above
  4. We should be good for now sticking to just RUNs+CLONEs and not really using GENs, as this gives max (2^16)^2 WUs in a single PROJECT, or nearly 4.3 billion
    • for reference, this is about 100,000x larger than the total Tasks completed on the alchemiscale.org alchemiscale instance in last 6 months

Misc. comments/questions:

  1. Excellent that we can do cert renewals via the API! I'm building out our Python client implementation based on the one you shared with me here, in this PR. What API endpoint should I hit, and how, to get a refreshed cert? I don't see this detail documented in the adaptive sampling API doc.
  2. Adding RUN and CLONE deletion endpoints would be helpful for ensuring the WS doesn't run out of space as we execute WUs.
  3. If we do have RUN and CLONE deletion, we'll need to have some way of finding out from the WS what the next unused RUN ID is, and probably the same for CLONE ID, since we shouldn't re-use them.
jcoffland commented 1 year ago

I agree with sticking to just runs/clones in this case. Since your code is running on the same server there's no need to upload or download files via the API. The files will be in /var/lib/fah-work/data/PROJ####/RUN###/CLONE###/. You can access this directory directly and read WU results or add and delete any files you like.

Set gens=1. Then to add a job call PUT /api/projects/####/runs/###/clones/###/create. This will cause the WS to call create-command but it doesn't have to do anything. It could just be /bin/true as long as you ensure that the files pointed to by the send variable are in place when the WS packages the WU for sending. So you could just write state.xml to the appropriate job directory before calling the API create endpoint.

By calling GET /api/projects/####/jobs?since=<last time> periodically you can get a list of which jobs have changed.

You should keep track of the RUN and CLONE of the last job created. Then just increment these values as you create new jobs.

Your /etc/fah-work/config.xml might look like this:

<config>
  ...
  <core type="0x23">
   <gens v="1"/>
    <create-command v="/bin/true"/>
    <send>
      core.xml
      system.xml
      integrator.xml
      state.xml
    </send>
  </core>
</config>

Any core 0x23 project will then have these defaults. You can sym-link all the files into the job directory before calling the create API endpoint.

What API endpoint should I hit, and how, to get a refreshed cert? I don't see this detail documented in the adaptive sampling API doc.

You need to create a CSR (Certificate Signing Request) just like you do initially, then using the credentials you've already acquired to access the WS/AS APIs submit the CSR to the AS API endpoint /api/auth/csr as a JSON object:

{"csr": "<CSR content here>"}

The response should look like this:

{
   "certificate": "<New certificate>",
   "as-cert": "<AS certificate>"
}

Then in subsequent calls to the AS or WS use the new certificate.

dotsdl commented 11 months ago

Thanks @jcoffland for this! I've taken your feedback and created an updated proposal from the first one given above, and sharing it here for visibility.

I'm working to implement this now, and will follow up here if I hit any snags. @jchodera and @sukritsingh: if you see any issues with this, please let me know.


alchemiscale reference terms:

Here is my updated proposal for how we will interface the alchemiscale/gufe execution model with the Folding@Home execution model:

Defaults defined in /etc/fah-work/config.xml:

<config>
  ...
  <core type="0x23">
   <gens v="1"/>
    <create-command v="/bin/true"/>
    <send>
      $home/RUNS/RUN$run/CLONE$clone/core.xml
      $home/RUNS/RUN$run/CLONE$clone/system.xml.bz2
      $home/RUNS/RUN$run/CLONE$clone/integrator.xml.bz2
      $home/RUNS/RUN$run/CLONE$clone/state.xml.bz2
    </send>
  </core>
</config>
dotsdl commented 8 months ago

@jcoffland I'm currently getting 404: {"error":"Job P12600:R0:C0 not found"} when I try to perform a PUT to https://mskcc2.foldingathome.org/api/projects/12600/runs/0/clones/0/create .

Any insights as to why this might occur?

The PROJECT exists, and I don't think it's necessary to explicitly create a RUN first (there is no adaptive sampling endpoint for creating a RUN, as far as I can tell)? Sukrit (CC'd) mentioned that this may have something to do with the WS not creating some state for itself until a FAH client connects to it requesting work. Is this the case?

dotsdl commented 8 months ago

@jcoffland regarding points: in our usage pattern detailed above each FAH RUN corresponds to an alchemiscale Transformation, and these can vary not only in number of atoms and nonbonded parameters (which we account for by choosing a PROJECT of closest "effort"), but also in run length. This means that within a PROJECT, run lengths of work units may vary widely between RUNs, even if memory requirements and per-step compute is similar.

Is there a mechanism (like a multiplier?) that can be applied per-RUN on the base credit for the PROJECT? If so, any recommendation on how we might apply it based on run-length of the work units in a RUN?

jcoffland commented 8 months ago

From looking at the code, there appears to be two ways in which you can get that message:

1) If you do a GET instead of a PUT 2) If the argument clone is in the request. e.g. /?clone=x

I think the first option is the most likely. Make sure you're really doing a PUT request.

dotsdl commented 7 months ago

@jcoffland thanks for this! Just checked: we are indeed doing a PUT request, and we are not providing any query parameters in our request.

Can you give guidance on what to try next? Also, if you're able to perform the PUT to the endpoint above, do you you see the same thing?

jcoffland commented 7 months ago

It could be a version issue. The latest WS release is v10.3.4 you were running v10.3.1. I push an upgrade to your machine.