DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
264 stars 44 forks source link

Write to mount #211

Closed rivershah closed 3 years ago

rivershah commented 3 years ago

This is not a specific issue, more of a discussion (github is launching a discussions feature shortly, where this item belongs). I was able to mount and read from a path successfully with both local and cloud providers. I noticed that I was unable to write to the disk mount like so:

d_path = os.path.join(os.environ["DISK_MOUNT"], "test.csv")
df = pd.DataFrame([[0, 0], [1, 1]], index=["i0", "i1"], columns=["c0", "c1"]) 

I specified the output path and was then able to write to the relevant location, for example like this:

d_path = os.path.join(os.environ["OUTPUT_PATH"], "test.csv")
df = pd.DataFrame([[0, 0], [1, 1]], index=["i0", "i1"], columns=["c0", "c1"]) 

My question is, why is writing to disk mounts prohibited (read only fs)? Is this a good practice point or something to do with gcp fuse or some other limitations / constraints. Thanks for any explanation to help understand this better.

mbookman commented 3 years ago

If I understand correctly, you are asking about the --mount option of mounting either a persistent disk (from image) or a GCS bucket using GCSfuse.

The intention of --mount in both cases is to make available large read-only resource sets. The documentation in the top-level README indicates:

Mounting "resource data"

If you have one of the following:

  1. A large set of resource files, your code only reads a subset of those files, and the decision of which files to read is determined at runtime, or
  2. A large input file over which your code makes a single read pass or only needs to read a small range of bytes,

then you may find it more efficient at runtime to access this resource data via mounting a Google Cloud Storage bucket read-only or mounting a persistent disk created from a Compute Engine Image read-only.

Please review that documentation, along with

https://github.com/DataBiosphere/dsub/blob/master/docs/input_output.md https://github.com/DataBiosphere/dsub/blob/master/docs/providers/README.md

Lastly, as a general comment, while gcsfuse allows for mounting read/write, it is easy to get into trouble using gcsfuse for writing objects. We'd need a strong use case to add mounting buckets read/write before we would consider making it an option in dsub. Recommend reading:

https://github.com/GoogleCloudPlatform/gcsfuse#performance

rivershah commented 3 years ago

@mbookman Yes the question was indeed for the --mount option. Thanks for the answer and the documentation links. Very straightforward refactors on my end get output files in the desired locations. Will keep the --mount option for strictly read only resources.