Open hnawar opened 6 years ago
Hi @hnawar.
There is no support presently for attaching a disk read-only, although that is a very reasonable request and something that can be done now with the google-v2
provider (this was not possible with the google
provider).
For now I would suggest experimenting with putting the resources into a GCS bucket and mounting the bucket with gcsfuse. The --mount
parameter was just added with the most recent release, 0.2.1.
How to use is here:
We tried GCSFuse but the performance was underwhelming. It would be interesting to add the option to specify the name of a read only disc and path to mount
Thanks @hnawar. Looking deeper into the Pipelines v2 disk support, I'm not sure that this is in fact supported.
https://cloud.google.com/genomics/reference/rest/Shared.Types/Disk https://cloud.google.com/genomics/reference/rest/Shared.Types/Action#Mount
Implies the ability to mount disks read only into an action, but the Disk resource does not appear to support a way to mount an existing PD to the VM first.
Will verify this with the Cloud Health team.
I checked with the Cloud Health team and the recommended approach here is to create a Compute Engine Image and create the disk from that image.
Next step will be to wire through a new --mount
option. I think it would look something like:
--mount RESOURCES=https://www.googleapis.com/compute/v1/<image-path>
As an example, using one of the public ubuntu images:
--mount RESOURCES=https://www.googleapis.com/compute/v1/projects/eip-images/global/images/ubuntu-1404-lts-drawfork-v20181102
we would key off of https://www.googleapis.com/compute
to detect the request to mount a GCE image in the same way that we key off of gs://
to detect mounting a GCS bucket. Implicit here is that we would request creation of a new disk which would be mounted readOnly into the user-action container.
Experimental support for mounting a PD built from a Compute Engine Image has been added in release 0.2.4, specifically with change https://github.com/DataBiosphere/dsub/pull/139/commits/0c4a93a59dc5e00100e1e4edae761ee7e761bddd.
Let us know how this goes.
Thanks, Just resumed working on this after the break. Just want to add a key factor is the cost, the same job will run for 1000s of data points each will need to read the same 20TB , this means for every job a 20TB PDD will be created this will will limit the number of jobs running in parallel due to the PDD quota and also add significant cost to the whole process
I came across this issue when I did a little searching. I'm trying to run Alphafold with dsub but the 3TB database disk is increasing the cost by quite a lot. I'm wondering if mounting read-only and sharing across multiple instances would be happened through in the near future or not ? My dsub command right now is like this,
dsub --provider google-cls-v2 \
--project ${PROJECT_ID} \
--logging gs://$BUCKET/logs \
--image=$IMAGE \
--script=alphafold.sh \
--mount DB="${IMAGE_URL} 3000" \
--machine-type n1-standard-16 \
--boot-disk-size 100 \
--subnetwork ${SUBNET_NAME} \
--accelerator-type nvidia-tesla-k80 \
--accelerator-count 2 \
--preemptible \
--zones ${ZONE_NAMES} \
--tasks batch_tasks.tsv 9
I tried to put a disk uri, https://www.googleapis.com/compute/v1/projects/my-project/zones/us-west1-b/disks/my-disk
, to the IMAGE_URL
but it didn't work. It said it's not a supported resources in the log.
And speed matters, similar to @hnawar , so mounting a bucket won't be an option.
Hi @michaeleekk !
The way to do this (as is supported by the Lifescienes API) is to create a disk with the resources file(s) and then create a GCE Image from that disk.
In dsub
, you can then have a new disk created from the Image as described here:
https://github.com/DataBiosphere/dsub#mounting-resource-data
Creating Disks from Images should be much faster than pulling all of the data from GCS a file at a time. Please give this a try and let us know how it goes.
@mbookman Thanks for the reply.
I tried that method. But in this way, each of the instance will have a individual 3TB disk attached to the instance. The cost of each run will become expensive.
That's why I was asking if there is a way to share a disk between multiple instances, like mentioned in this page.
That is supported by the Life Sciences API by using Volume instead of Disk https://cloud.google.com/life-sciences/docs/reference/rpc/google.cloud.lifesciences.v2beta?hl=en#google.cloud.lifesciences.v2beta.ExistingDisk But it is not supported by dsub at the moment.
I'm not very familiar with the code, but it's worth evaluating how much effort is needed to switch
I just had a peek and, I guess I might be wrong but, having another mount type around here like ExistingDiskMountParam
plus handling this new class and parse the uri more properly might work. dsub seems to be a wrapper around the LifeSciences API, so it's theoretically workable I guess.
Thanks for the pointer @hnawar ! I had not seen the ExistingDisk support added (late 2020).
We'll look at extending the --mount
flag to take advantage of the capability.
Release https://github.com/DataBiosphere/dsub/releases/tag/v0.4.7 adds support for the ExistingDisk by extending the URL formats recognized with the --mount
flag.
Heres the change: https://github.com/DataBiosphere/dsub/commit/2d0b808def65bc6100e4da81d9f82e241bbfb8c9
Please take a look and let us know how it goes for you. This has (obviously) been a long standing feature request and we're pretty interested to hear about how much computational time you find it saves you.
I'm trying to run multiple pipelines that all read the same data (~20TB) . Copying this to all containers seems unreasonable and the best way I could think off is to put this on a shared read-only PD.
Is there a way to attach a readonly data disk?