AgnostiqHQ / covalent

Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
https://www.covalent.xyz
Apache License 2.0
766 stars 91 forks source link

Allow for the creation of unique subfolders in the current working directory to avoid file overwriting #1628

Closed Andrew-S-Rosen closed 1 year ago

Andrew-S-Rosen commented 1 year ago

What should we add?

There needs to be a way in which files can be written out to the current working directory without them potentially overwriting each other between calculations. The solution is to have unique subfolders for each Electron, as described below.

Motivation

In many quantum chemistry codes, files are often written out to the filesystem at runtime. Typically, these files are hard-coded to be written to the current working directory and are both read from and written to throughout the course of the calculation. Therefore, if one launches a quantum chemistry code from within an Electron, these files will be overwriting each other if multiple calculations are going on simultaneously. It is also impossible to preserve the provenance of where these files originated from.

On a personal note, I would love to get my computational chemistry and materials science colleagues on board with Covalent, but this is currently a major dealbreaker in terms of adoption (hopefully not for long though!).

The Problem

In the toy example below, the files are both written out to the same working directory, which will cause file loss and result in unexpected errors in more complex examples where the file I/O cannot be explicitly controlled by the user. For instance, with the SLURM plugin, everything is written out to the same remote_workdir folder (see #1619 for where this currently is for local executors; there is a bug).

import covalent as ct
import os

@ct.electron
def job(val1, val2):
    with open("job.txt", "w") as w:
        w.write(str(val1 + val2))
    return "Done!"

@ct.lattice
def workflow(val1, val2, val3, val4):
    job1 = job(val1, val2)
    job2 = job(val3, val4)
    return "Done!"

dispatch_id = ct.dispatch(workflow)(1, 2, 3, 4)
result = ct.get_result(dispatch_id)
print(result)

Suggestion

My recommendation is that there should be a keyword argument somewhere of the type create_unique_workdir: bool. If set to True, then wherever the default working directory is (which may depend on the executor), Covalent would automatically create subfolders of the form <dispatchID>/node_<nodeID> in the current working directory. Each calculation node would cd into its respective <dispatchID>/node_<nodeID> directory to avoid file overwriting concerns by any external program writing to the current working directory during runtime.

Note that this is distinct from the results_dir, which is where the .pkl and .log files go and may not be on the same filesystem. For instance, in the SLURM executor, the current working directory is set by the remote_workdir. The same could be done for the local executors except that it would be a local workdir as the base directory.

In addition to preventing file overwriting, this has the added benefit of ensuring that the files written out at runtime can be linked back to their corresponding Electrons for reproducibility purposes.

Note

This issue should be addressed after #1619 is closed. This issue was originally discussed in #1592, which I split into two separate issues.

Describe alternatives you've considered.

Always use absolute paths for file I/O

For full context, see here. This route is not sufficient because there are many use cases where the user cannot control where the files are written out to at runtime. That is usually specified by the external code that is being run and is often the current working directory. Also, if Covalent ever adds a feature where the executor is dynamically selected, the user may not know in advance which filesystem the calculation will be run on.

File transfer mechanisms and call backs/befores

For full context, see here. Unfortunately, this route isn't sufficient for the problem. We want to make sure that if there are multiple ongoing calculations that write files out to the current working directory during runtime that they do not overwrite one another. Many computational chemistry codes are hard-coded to rely on being able to write and read input and output files in the current working directory throughout the calculation itself. Transferring files before or after the calculation finishes is a separate issue that should be left to the user.

Using a decorator to change the working directory

For full context, see here. The relevant code snippet from @santoshkumarradha is below:

import covalent as ct
from pathlib import Path

import os
from functools import wraps

def change_dir_and_execute(directory):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            current_dir = os.getcwd()
            try:
                os.chdir(directory)
                result = func(*args, **kwargs)
            finally:
                os.chdir(current_dir)
            return result
        return wrapper
    return decorator

path=Path(".").absolute()

@ct.electron
@change_dir_and_execute(path)
def job(val1, val2,file):
    with open(file, "w") as w:
        w.write(str(val1 + val2))
    return Path(".").absolute()

@ct.lattice
def workflow(file, val1, val2, val3, val4):
    job1 = job(val1, val2,file)
    return job1

file="example.txt"
dispatch_id = ct.dispatch(workflow)(file, 1, 2, 3, 4)
result = ct.get_result(dispatch_id, wait=True)
print(result)

While this may be possible, it is not concise or clean. One of the major benefits of Covalent as a whole is that there is minimal friction to go from writing a function to running a complex workflow. Given the many foreseeable use cases where significant I/O is written out to the current working directory at runtime (without this being something that can be changed), always needing this verbose approach is less than ideal.

santoshkumarradha commented 1 year ago

@cjao looping you in, the suggestion seems nice, on top of my head I dont see any pitfalls (but I feel like I am missing something), can you comment your thoughts here ?

cjao commented 1 year ago

@arosen93 This is a good suggestion and would be easy to implement at least for the SLURM plugin.

Will you need to access those files afterwards or are they only used to record intermediate states during a computation?

Andrew-S-Rosen commented 1 year ago

Glad to hear the idea is worthwhile!

When possible, the files should be accessible afterwards in case the user wants to share them or debug something (in a manual mode, independent of covalent). They will often persist on the filesystem. However, if a given executor doesn't allow for this, then that's okay. The critical part is that the calculations are not running in the same working directory.

santoshkumarradha commented 1 year ago

@arosen93 This is a good suggestion and would be easy to implement at least for the SLURM plugin.

Will you need to access those files afterwards or are they only used to record intermediate states during a computation?

Thats actually a good call, Making this executor dependent feature. Point that was bothering me is making this universal as I was not sure how remote file systems for cloud executors like AWS would work.

Awesome, I will add this in slurm issues.

Andrew-S-Rosen commented 1 year ago

Thanks, folks!

I think the only other thing to mention here is that I think we should try to make this somewhat consistent between executors when it is possible to control. But that shouldn't be hard. I can try to add the analagous feature to the local executor once #1619 is fixed (but certainly feel free to give it a go, dear reader, if you are feeling inspired).

cjao commented 1 year ago

The files should be accessible afterwards in case the user wants to share them or debug something (in a manual mode, independent of covalent). They will often persist on the filesystem.

I wonder @santoshkumarradha if we can make it easier for users to correlate tasks with their corresponding executor scratch space.

Andrew-S-Rosen commented 1 year ago

See https://github.com/AgnostiqHQ/covalent-slurm-plugin/pull/57 for the PR in the Slurm plugin, which other executors should adopt where possible for consistency.

madhur-tandon commented 1 year ago

@kessler-frost this can be closed too since there are separate issues for slurm and ssh plugins... @arosen93 confirmed this too I guess...

Andrew-S-Rosen commented 1 year ago

Yup, confirmed. This can be closed since everything in "main Covalent" is solved, and the respective plugins have issues.