DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Wrong machine type when using google-batch provider #283

Closed nithinjoshy closed 6 months ago

nithinjoshy commented 10 months ago

I am attempting to switch to use google batch as the provider for dsub but I am encountering a problem where the machine type for the worker is always "e2-highcpu-2" instead of what I have provided as arguments. I have been using the below call for the last year or so without the "--provider google-batch" line and it has worked which makes me think google-batch is the problem.

dsub \
    --provider google-batch \
    --project ${PROJECT} \
    --zones "us-central1-a" \
    --logging gs://${DSUB_BUCKET}/logs \
    --env ID=xxx \
    --env PROJECTID="xxx" \
    --input SCRIPT=gs://${BUCKET}/xxx.py \
    --input XXXX_SCRIPT=gs://${BUCKET}/xxx.py \
    --input REF=${REF} \
    --input-recursive XXXDIR=${inputurl} \
    --output-recursive OUTPUT_PATH=gs://${OUTPUT_BUCKET}/xxx \
    --output-recursive XXX_PATH=gs://${XXX_BUCKET}/xxx \
    --image ${XXX_IMAGE} \
    --script xxx.sh \
    --disk-size 1000 \
    --name "xxxx" \
    --machine-type n1-highmem-16 \
    --boot-disk-size 30

I have tried switching to use the following two lines instead of "--machine-type" but still get an "e2-highcpu-2" machine which is not sufficient for my needs.

    --min-ram 80 \
    --min-core 8 \

My jobs are failing with the following error although I am asking about this because I believe that the reason for that is that the memory is insufficient for the program I am running.

Job state is set from RUNNING to FAILED for job projects/xxxxxxxxxx/locations/us-central1/jobs/xxxxx. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-f/instances/xxxxxxxxx due to Batch no longer receives VM updates with exit code 50002.

I am curious if there is any idea about why it is always e2-highcpu-2 and if there is some way I can change this to get a different machine.

I apologize in advance if I missed that information about this was already written somewhere or this is the wrong place to ask about this.

wnojopra commented 10 months ago

Hi @nithinjoshy! We'll need to update the google-batch provider in a future release. The code for specifying a custom machine is here, in the google_v2_base provider, which unfortunately isn't currently shared with the google-batch provider. The code for Batch provider's instance policy will also need to be updated to take in this machine type, based on this documentation.

lm-jkominek commented 9 months ago

@wnojopra Just curious if there are any updates on allowing custom machine types with Google Batch?

mccstan commented 8 months ago

Hello, I am contributing to this and i have an ongoing PR : https://github.com/DataBiosphere/dsub/pull/285

lm-jkominek commented 8 months ago

Thank you @mccstan , appreciate this so much! I was actually starting to look into building my own wrapper around gcloud batch to get this done, but it's awesome to see this being fixed within dsub :)

mccstan commented 8 months ago

@wnojopra My PR is ready for Review.

wnojopra commented 6 months ago

Hi @nithinjoshy and @lm-jkominek :

We just put out release 0.4.11, which includes:

When you get the chance, can you verify if it resolves your issues?

lm-jkominek commented 6 months ago

@wnojopra, thank you for this, much apprish! And sure thing, I will take a look later this week to see how it performs in the wild

lm-jkominek commented 6 months ago

@wnojopra, I run 0.4.11 with google-batch and it did provision the resources that I asked, so I can confirm that it works, at least for me, yay! One issue I noticed though was that for some reason all the VMs spun up in the us-central1 region instead of us-east1, which I specified via --regions so something may be amiss in that corner? I submitted the jobs with a command that I normally use with google-cls-v2, unless batch uses a different param for that?

wnojopra commented 6 months ago

Hi @lm-jkominek great to hear that everything works except the regions. I've filed #289 to track the region issue. It seems like a quick fix for the next release.