Fixed problem with --min-ram when using google-batch provider

pgm commented 3 months ago

I tried switching from using pipeline API to the google-batch provider for a batch job that we've run regularly for years. However, when submitting it as a job to the batch API, the job consistently was getting killed, apparently by the OOM killer.

The job was submitted with --min-ram 10 and when looking at the cloud console, I can see it allocated an appropriate machine type, but the memory reported is 1.95.

Looking at the code I noticed that compute_resources was not being set, and according to the batch api docs there are per-task resource memory and cpu caps which default to 2GB and 2 cpus if not specified otherwise.

This PR adds a change which uses the values from job_resources to populate the compute_resource field on the job submission.

It seems to work for me, and I thought it'd be useful to merge it into the mainline for anyone else that might encounter this. (Especially as the pipeline API is being shutdown and the Batch API is what google recommends migrating to)

Please let me know if there are additional changes that you would like in order to get this merged in.

wnojopra commented 3 months ago

First and foremost, thank you so much for the PR @pgm!

So I had raised this issue with the Batch API about two months ago, and the conclusion was that it was a UI issue.

The expectation is that when you specify --min-ram 10, dsub will specify a custom machine that has at least 10 GB of RAM. That is what the custom-2-10240 machine type indicates.

So now we get to your screenshot. Despite the machine having 10GB of RAM, the Google Cloud UI says tasks are capped at 2GB RAM and 2 CPU cores. I asked the Batch API team about this, and they said that the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. But tasks are free to use all resources once they are the VM. I even tested it myself in this comment and verified that the task had all the memory available despite the UI saying otherwise.

So all that being said, are you noticing with this change that your job is no longer killed from OOM? If so, this would be a very recently new behaviour of the Batch API that I'd like to bring up with them.

pgm commented 3 months ago

Ah, I'll close this PR because it appears you're right.

I commented out my change and tried it the job submission without compute_resource being set, and it still worked fine.

Looking back on the sequence of events that led me to make this PR, I think the issue may have been that I was originally testing with an old version of dsub. With an older version, it definitely failed with signs of memory exhaustion.

Perhaps when I tried with the newest version, I didn't actually confirm that the job was still getting the same out of memory condition but instead only assumed it was when I saw the memory wasn't increasing when looking at the google console view of the tasks.

Regardless, seems like the newest version is working fine without my change.

Sorry for the confusion.

wnojopra commented 3 months ago

Appreciate the response @pgm . You're not the first one to be confused by what the UI is showing there, and this is good feedback for that team.

DataBiosphere / dsub

Fixed problem with --min-ram when using google-batch provider #295