DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
261 stars 43 forks source link

Feature request: Allow using an existing instance template on gcp #236

Open infalmo opened 2 years ago

mbookman commented 2 years ago

Thanks for the feature request, @infinitepr0 !

Can you describe more of the motivation behind the request?

The Cloud Life Sciences API that dsub uses to run tasks does not allow for the specification of an instance template:

https://cloud.google.com/life-sciences/docs/reference/rest/v2beta/projects.locations.pipelines/run#VirtualMachine

So we'd need to bubble up the request for the feature in dsub to the Google team supporting the API. The more you are able to articulate the value of the feature and what capabilities you are currently missing, the better the chance that they can resource an update to the API.

Thanks!

slagelwa commented 1 year ago

Might make it easier/simpler to submit jobs?

E.g. one might be able to replace this:

dsub \
    --provider google-cls-v2 \
    --network projects/XXXXX/global/networks/XXXXX-shared \
    --subnetwork projects/YYYYY/regions/us-west1/subnetworks/YYYYY-west1 \
    --service-account runner@xxxxx.iam.gserviceaccount.com \
    --region us-west1 \
    --use-private-address \
    --min-ram 32 \
    --min-cores 8 \
    --boot-disk-size 10 \
    --disk-size 1500 \
    --project myproject \
    --image us.gcr.io/myproject/bcl2fastq2:2.20.0 \
    --logging gs://mybucket/logging/ \
    --input-recursive INPUT_PATH=gs://mybucket/run \
    --output-recursive OUTPUT_PATH=gs://mybucket/fastq \
    --command 'bcl2fastq 
         --runfolder-dir /mnt/data/input/run 
         --output-dir /mnt/data/output/fastq 
         --sample-sheet /mnt/data/input/SampleSheet.csv' \
    --wait

with this?

dsub \
    --provider google-cls-v2 \
    --template convert \
    --project myproject \
    --image us.gcr.io/myproject/bcl2fastq2:2.20.0 \
    --logging gs://mybucket/logging/ \
    --input-recursive INPUT_PATH=gs://mybucket/run \
    --output-recursive OUTPUT_PATH=gs://mybucket/fastq \
    --command 'bcl2fastq 
         --runfolder-dir /mnt/data/input/run 
         --output-dir /mnt/data/output/fastq 
         --sample-sheet /mnt/data/input/SampleSheet.csv' \
    --wait

Granted if it was something you were running frequently most people would probably just throw it into a script with a few parameters and just use the script. But with the template at least if any of your machine or networking parameters need to change you just update the template instead of having to hunt down all your scripts.

mbookman commented 1 year ago

FWIW, we are in the process of adding support for the new Google Batch API.

One feature of the API is to Create a job from a Compute Engine instance template.

Once we have feature parity and stability of dsub with the Batch provider, we'll explore some of the new capabilities that the new API enables.

slagelwa commented 1 year ago

I hadn't heard of Google Batch API. Looks like a replacement for LIfe Sciences? Do you think there are going to be any limitations using Batch over Life Sciences?