DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
262 stars 44 forks source link

Google Batch provider needs to support --mount #290

Closed FreshAirTonight closed 3 months ago

FreshAirTonight commented 4 months ago

Thank you for rolling out the new version with updated support for Batch. It appears that the GS bucket mount is not supported yet, or am I missing something?

wnojopra commented 4 months ago

Hi @FreshAirTonight! Thanks for asking - no mount is not yet supported for the google-batch provider. It will likely be a part of the next release. I will report back here when available.

FreshAirTonight commented 4 months ago

Hi @wnojopra Thank you for the answer. The reason I am asking is because I am interested in the new machine types for nvidia L4, which is available with google-batch but not with google-cls-v2. Is there any plan to add the support for G2 machine types to google-cls-v2?

wnojopra commented 4 months ago

It's unlikely we'll get that to work in google-cls-v2 for two reasons: 1) The Lifesciences API (google-cls-v2 provider) is going away in favor of the Batch API (google-batch) provider.
1) I don't think there's anything dsub can do to support additional machine types. That sounds like something the Lifesciences API controls.

I'm actually not familiar with nvidia L4 being not available with the Lifesciences API. Is there any documentation or anything that helps explain?

FreshAirTonight commented 4 months ago

It's unlikely we'll get that to work in google-cls-v2 for two reasons:

1. The Lifesciences API (google-cls-v2 provider) is going away in favor of the Batch API (google-batch) provider.

2. I don't think there's anything dsub can do to support additional machine types. That sounds like something the Lifesciences API controls.

I'm actually not familiar with nvidia L4 being not available with the Lifesciences API. Is there any documentation or anything that helps explain?

I tried Nvidia L4 with the Lifesciences API, and here is the error messages I got:

"Error: validating pipeline: unsupported accelerator: "nvidia-l4"". Details: "Error: validating pipeline: unsupported accelerator: "nvidia-l4"">

But I understand that there is not much motivation to make any change to the expiring API. Thank you for your explanation.

lm-jkominek commented 4 months ago

@FreshAirTonight, I run into this recently as well. @wnojopra is 100% correct that the Lifesciences API is deprecated and only has about a year before EOL, so we should all be moving to batch anyways. I don't think there is anything that can be done within dsub to work around it, if the underlying API doesn't support it.

The API deprecation date was in July last year, which was barely 4 months after the L4s were released (March) and only 2 months after the L4s and their dedicated G2 accelerated VMs were added into general availability on GCP. I couldn't find any specific docs on them being supported or not, but I think it is fair to assume that they just didn't bother adding support of either (or both) to the API, since they already knew it will be going away anyways...

wnojopra commented 3 months ago

Hi @FreshAirTonight

We released v0.4.12 yesterday which includes support to mount GCS buckets with the google-batch provider.

When you get the chance, can you verify if it resolves your issues?

FreshAirTonight commented 3 months ago

Hi @FreshAirTonight

We released v0.4.12 yesterday which includes support to mount GCS buckets with the google-batch provider.

When you get the chance, can you verify if it resolves your issues?

@wnojopra Thank you very much for this release! It has addressed the main issues I had. The mount option works, and I can access the L4 machine types with google batch.

I noticed two minor issues in my tests: (1) dstat and ddel threw some error messages like the following: File "/home/${username}/anaconda3/lib/python3.9/site-packages/dsub/providers/google_batch_operations.py", line 93, in get_create_time return _pad_timestamps(op.create_time.rfc3339()) AttributeError: rfc3339. There seems an issue with time string formatting.

(2) On the web interface of google batch, the "Memory per task" and "Core per task" show only "1.95GB" and "2 vCPU", even though the machine type has much higher specification than that. This happens regardless using "MIN_CORES=8 MIN_RAM=32" or not in my job submission. gcp_batch

wnojopra commented 3 months ago

That's great, thanks @FreshAirTonight ! On your issues:

1) I've been trying to reproduce this one but haven't seen it so far. What I have been able to figure out is that the create_time field is a proto.datetime_helpers.DatetimeWithNanoseconds, which has a rfc3339 method. But I'm not sure what dependency this comes from. Could you please run pip list and show me the output? I'd be interested to see what versions you're running and what differs from mine. In particular, these 3 might be culprits (these are the version I have):

$ pip list | grep proto
googleapis-common-protos 1.63.0
proto-plus               1.23.0
protobuf                 4.25.3

2) I noticed the exact same thing, and raised the issue with the Batch API team.

They say the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. **But tasks are free to use all resources once they are the VM.** 

I was able to confirm this with my own testing. I submitted a task with n2-standard-4 machine type, checked /proc/meminfo, and saw ~15 GiB of memory available. 
FreshAirTonight commented 3 months ago

@wnojopra I got the same versions on the three packages you mentioned:

pip list | grep proto
googleapis-common-protos      1.63.0
proto-plus                    1.23.0
protobuf                      4.25.3

I realized that the issue was caused by another version of protobuf. I have purged that package and now the issue is gone.

lm-jkominek commented 3 months ago
2. I noticed the exact same thing, and raised the issue with the Batch API team.
   They say the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. **But tasks are free to use all resources once they are the VM.**
   I was able to confirm this with my own testing. I submitted a task with n2-standard-4 machine type, checked /proc/meminfo, and saw ~15 GiB of memory available.

@wnojopra, just to be sure - this means that Batch web interface will display the low per-task specs, but it will actually honor the per-job specified resource requirements on the backend?

FreshAirTonight commented 3 months ago
2. I noticed the exact same thing, and raised the issue with the Batch API team.
   They say the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. **But tasks are free to use all resources once they are the VM.**
   I was able to confirm this with my own testing. I submitted a task with n2-standard-4 machine type, checked /proc/meminfo, and saw ~15 GiB of memory available.

@wnojopra, just to be sure - this means that Batch web interface will display the low per-task specs, but it will actually honor the per-job specified resource requirements on the backend?

Can't say for all use cases. But this is true in my case when G2 machine types were used. My job would fail if only 1.95GB memory (according to batch web interface) is available for my job consumption. Frankly, Google batch web interface needs some improvement to avoid confusing its users.

wnojopra commented 3 months ago

I realized that the issue was caused by another version of protobuf. I have purged that package and now the issue is gone.

Thats great to hear! I'll close off this issue now. I'll also check to see if I can require a specific version of protobuf for the next release.

@wnojopra, just to be sure - this means that Batch web interface will display the low per-task specs, but it will actually honor the per-job specified resource requirements on the backend?

dsub currently submits one Batch Job for each dsub task. And the Batch team has confirmed with me that tasks are free to use all resources once they are on the VM. So I believe the answer to your question is effectively, yes.

wnojopra commented 3 months ago

One last comment here for others who may have issues with google-batch and protobuf: I was able to reproduce an issue with protobuf 3.18.0. With 3.19.0, things seem to be working fine again.

I do see that many newer versions have been released since then. In the next release of dsub, we will require 3.19.0 < protobuf < 5.26.0.