SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

Directory and path management #75

Open mrshirts opened 3 years ago

mrshirts commented 3 years ago

Suggestion that each instance of the simulation running has its own directory.

Many simulation programs generate intermediate files that have names that are the same each time the program in invoked. Having multiple instances of the simulation running the the same directory is likely to cause problems with the simulations interfering with each other.

Users should not need to know what the paths of those directories are; this would be something that scale-ms manages automatically.

FYI, Signac from the Glotzer group is a good thing to look at for inspiration for a software that manages large scale simulation ensembles (no adaptive control there, just keeping track of lots of files from lots of instances of programs): https://docs.signac.io/en/latest/)

peterkasson commented 3 years ago

I think each instance should have its own sandbox, but this shouldn't be by default persistent or externally accessible: declared outputs are returned outside of the sandbox. (Having a complex externally accessible tree of results files is something we did with Copernicus, and it was a real pain from a user point of view.) Does that make sense?

On Tue, Nov 24, 2020 at 8:06 AM Michael Shirts notifications@github.com wrote:

Suggestion that each instance of the simulation running has its own directory.

Many simulation programs generate intermediate files that have names that are the same each time the program in invoked. Having multiple instances of the simulation running the the same directory is likely to cause problems with the simulations interfering with each other.

Users should not need to know what the paths of those directories are; this would be something that scale-ms manages automatically.

FYI, Signac from the Glotzer group is a good thing to look at for inspiration for a software that manages large scale simulation ensembles (no adaptive control there, just keeping track of lots of files from lots of instances of programs): https://docs.signac.io/en/latest/)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SCALE-MS/scale-ms/issues/75, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRWZNP22IAKEJIUAEYTN6TSRPKYNANCNFSM4UBCWDSQ .

mrshirts commented 3 years ago

I think each instance should have its own sandbox, but this shouldn't be by default persistent or externally accessible:

I'm totally fine with that. The linking of signac was just to reference an existing and relatively robust implementation (again, they didn't have adaptivity, so much simpler).

declared outputs are returned outside of the sandbox.

How would this work for handling the input files for the next generation, that don't really have a necessary existence outside of the sandboxes? This is occurring as I write a script to generate the next "generation" of input files, looping over them, since the modified structure files (i.e. they are not direct output, they are operated on) in some sense are created before the "modify_input" function is run.

Though maybe any such files should actually be created by the "modify_input" function, which would have access to sandboxes for the next instances.

eirrgang commented 3 years ago

Ah, thanks. I thought this was already tracked, but I see the closest documented thing is https://github.com/SCALE-MS/scale-ms/issues/15

Yes, I'm pleased with the way Signac turned out, and the data management scheme for gmxapi/scalems is conceptually similar.

See also https://github.com/kassonlab/gmxapi/milestone/4

eirrgang commented 3 years ago

declared outputs are returned outside of the sandbox.

In the most general case, there is an extra copy of data that is requested to be present outside of the sandbox. One copy owned by the user and one "owned" by scalems. For non-local execution, there is a remote copy in a sandbox and local copy not in a sandbox.

eirrgang commented 3 years ago

Though maybe any such files should actually be created by the "modify_input" function, which would have access to sandboxes for the next instances.

Yes.

mrshirts commented 3 years ago

So, it seems to be coming around to:

  1. Each fresh instance of the execution has their own fresh directory.
  2. modify_input knows about the fresh directories, so any new input files get written there.
  3. SCALE-MS only knows about files generated by the executable that are explicitly identified by the users.
  4. Files that scale-ms doesn't know about just get left in the directory, and cannot be used by the workflow. Which is fine, the user knows what files are important.

One other request:

The whole tree of files that are actually produced should be able to be inspected afterwards, but it can be separate from the workflow language. It would be more for debugging to see what happened afterwards.

eirrgang commented 3 years ago

For Tasks, rp.raptor.Worker forks itself for each request. (The callable registered with Worker.register_mode() executes in a fresh fork of the Worker task.) The callable is free to chdir. There may be some interplay with the environment isolation work in progress with RP, but it is probably safe to let the callable be responsible for creating its working directory as well. Directory names for tasks are to be deterministically named, unique at least as far as the known deterministic aspects of the task, but maximally recoverable across interrupted execution. As such, we may not require the Worker or callable to calculate its own UUID, but we will likely embed it at the Master.request_cb() or on the client side, before submitting the task. This way, we will maintain a record of expected remote directories before they are even created so that we can best recover workflow state after interruption.

eirrgang commented 1 year ago

I'm going to "resolve" this issue by adding some metadata tracking for the site at which filesystem paths are useful. As much as possible, I'll add some utility functions and documentation for determining how to access remote files in terms of the RP resource definition.

Additional functionality for automated data staging will be submitted under separate issues.

Note: for the most part, tasks are executed in separate clean working directories. Some updates may be needed to manage working directories for raptor tasks, but @andre-merzky is working on some related infrastructure, so we'll revisit that in a few days. The near-term access will continue to be the directory archive that is delivered as from scalems.call._Subprocess, but that will be refined a bit, I hope. scalems.write() is a bit further down the line, but will probably be available by the end of the sprint.

andre-merzky commented 1 year ago

Note: for the most part, tasks are executed in separate clean working directories. Some updates may be needed to manage working directories for raptor tasks

This should be resolved by now - raptor tasks can request their own sandbox now.