DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Add support to attach read only disk #132

Open hnawar opened 6 years ago

hnawar commented 6 years ago

I'm trying to run multiple pipelines that all read the same data (~20TB) . Copying this to all containers seems unreasonable and the best way I could think off is to put this on a shared read-only PD.

Is there a way to attach a readonly data disk?

mbookman commented 6 years ago

Hi @hnawar.

There is no support presently for attaching a disk read-only, although that is a very reasonable request and something that can be done now with the google-v2 provider (this was not possible with the google provider).

For now I would suggest experimenting with putting the resources into a GCS bucket and mounting the bucket with gcsfuse. The --mount parameter was just added with the most recent release, 0.2.1.

How to use is here:

https://github.com/DataBiosphere/dsub/blob/c968683f309577ca86fdd0c05fdf618a938c6088/README.md#mounting-buckets

hnawar commented 6 years ago

We tried GCSFuse but the performance was underwhelming. It would be interesting to add the option to specify the name of a read only disc and path to mount

mbookman commented 6 years ago

Thanks @hnawar. Looking deeper into the Pipelines v2 disk support, I'm not sure that this is in fact supported.

https://cloud.google.com/genomics/reference/rest/Shared.Types/Disk https://cloud.google.com/genomics/reference/rest/Shared.Types/Action#Mount

Implies the ability to mount disks read only into an action, but the Disk resource does not appear to support a way to mount an existing PD to the VM first.

Will verify this with the Cloud Health team.

mbookman commented 6 years ago

I checked with the Cloud Health team and the recommended approach here is to create a Compute Engine Image and create the disk from that image.

Next step will be to wire through a new --mount option. I think it would look something like:

--mount RESOURCES=https://www.googleapis.com/compute/v1/<image-path>

As an example, using one of the public ubuntu images:

--mount RESOURCES=https://www.googleapis.com/compute/v1/projects/eip-images/global/images/ubuntu-1404-lts-drawfork-v20181102

we would key off of https://www.googleapis.com/compute to detect the request to mount a GCE image in the same way that we key off of gs:// to detect mounting a GCS bucket. Implicit here is that we would request creation of a new disk which would be mounted readOnly into the user-action container.

mbookman commented 5 years ago

Experimental support for mounting a PD built from a Compute Engine Image has been added in release 0.2.4, specifically with change https://github.com/DataBiosphere/dsub/pull/139/commits/0c4a93a59dc5e00100e1e4edae761ee7e761bddd.

Let us know how this goes.

hnawar commented 5 years ago

Thanks, Just resumed working on this after the break. Just want to add a key factor is the cost, the same job will run for 1000s of data points each will need to read the same 20TB , this means for every job a 20TB PDD will be created this will will limit the number of jobs running in parallel due to the PDD quota and also add significant cost to the whole process

michaeleekk commented 2 years ago

I came across this issue when I did a little searching. I'm trying to run Alphafold with dsub but the 3TB database disk is increasing the cost by quite a lot. I'm wondering if mounting read-only and sharing across multiple instances would be happened through in the near future or not ? My dsub command right now is like this,

dsub --provider google-cls-v2 \
  --project ${PROJECT_ID} \
  --logging gs://$BUCKET/logs \
  --image=$IMAGE \
  --script=alphafold.sh \
  --mount DB="${IMAGE_URL} 3000" \
  --machine-type n1-standard-16 \
  --boot-disk-size 100 \
  --subnetwork ${SUBNET_NAME} \
  --accelerator-type nvidia-tesla-k80 \
  --accelerator-count 2 \
  --preemptible \
  --zones ${ZONE_NAMES} \
  --tasks batch_tasks.tsv 9

I tried to put a disk uri, https://www.googleapis.com/compute/v1/projects/my-project/zones/us-west1-b/disks/my-disk, to the IMAGE_URL but it didn't work. It said it's not a supported resources in the log.

And speed matters, similar to @hnawar , so mounting a bucket won't be an option.

mbookman commented 2 years ago

Hi @michaeleekk !

The way to do this (as is supported by the Lifescienes API) is to create a disk with the resources file(s) and then create a GCE Image from that disk.

In dsub, you can then have a new disk created from the Image as described here:

https://github.com/DataBiosphere/dsub#mounting-resource-data

Creating Disks from Images should be much faster than pulling all of the data from GCS a file at a time. Please give this a try and let us know how it goes.

michaeleekk commented 2 years ago

@mbookman Thanks for the reply.

I tried that method. But in this way, each of the instance will have a individual 3TB disk attached to the instance. The cost of each run will become expensive.

That's why I was asking if there is a way to share a disk between multiple instances, like mentioned in this page.

hnawar commented 2 years ago

That is supported by the Life Sciences API by using Volume instead of Disk https://cloud.google.com/life-sciences/docs/reference/rpc/google.cloud.lifesciences.v2beta?hl=en#google.cloud.lifesciences.v2beta.ExistingDisk But it is not supported by dsub at the moment.

I'm not very familiar with the code, but it's worth evaluating how much effort is needed to switch

michaeleekk commented 2 years ago

I just had a peek and, I guess I might be wrong but, having another mount type around here like ExistingDiskMountParam plus handling this new class and parse the uri more properly might work. dsub seems to be a wrapper around the LifeSciences API, so it's theoretically workable I guess.

mbookman commented 2 years ago

Thanks for the pointer @hnawar ! I had not seen the ExistingDisk support added (late 2020). We'll look at extending the --mount flag to take advantage of the capability.

mbookman commented 2 years ago

Release https://github.com/DataBiosphere/dsub/releases/tag/v0.4.7 adds support for the ExistingDisk by extending the URL formats recognized with the --mount flag.

Heres the change: https://github.com/DataBiosphere/dsub/commit/2d0b808def65bc6100e4da81d9f82e241bbfb8c9

Please take a look and let us know how it goes for you. This has (obviously) been a long standing feature request and we're pretty interested to hear about how much computational time you find it saves you.