DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Use gcloud storage instead of gsutil #246

Open carbocation opened 2 years ago

carbocation commented 2 years ago

It seems that gcloud storage will be substantially faster for localization/delocalization vs gsutil. Seems like it would make sense to either apply the shim or to transition to using gcloud storage in place of gsutil in dsub.

mbookman commented 2 years ago

Thanks for the pointer @carbocation! We will take a look.

carbocation commented 2 years ago

So far in my tests, gcloud storage has been a successful drop-in replacement for gsutil (including the various tasks like ls, cat, cp, rm, etc, as well as with flags like -J, -n, etc). The only occasionally tricky bit (other than making sure the host machine used to launch dsub is upgraded and can use gcloud storage) has been to make sure that the docker instance has an acceptable version of google cloud tools to allow the same.

mbookman commented 2 years ago

Overall, gcloud storage looks pretty good. The performance improvements are real and minimal code changes are needed to get them. That's pretty exciting.

That said, I have twice (in only a limited number of total tests) had downloads fail with errors like:

ERROR: Source hash fjoXWA== does not match destination hash ELZtmQ== for object ./NA12878.cg.bam_.gstmp.

I've filed a bug for this and will post back here what I learn.

FWIW, the test was to pull down 11 file (1.2 TB) to a GCE VM:

$ gcloud storage cp gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/* .
carbocation commented 2 years ago

Hearing about the error you encountered, I ran another test and got an "exciting" result after about 8,000 ~1mb files were copied (out of ~80,000 files):

Pausing command execution:

This command requires the `gcloud-crc32c` component to be installed. Would you like to install the `gcloud-crc32c` component to continue command execution? (Y/n)?  
Copying gs://bucket/path/to/file.xml to file://./file.xml
Pausing command execution:

This command requires the `gcloud-crc32c` component to be installed. Would you like to install the `gcloud-crc32c` component to continue command execution? (Y/n)? 

I was not watching this happen, and it then proceeded:

For the latest full release notes, please visit:
  https://cloud.google.com/sdk/release_notes

Do you want to continue (Y/n)?  
╔════════════════════════════════════════════════════════════╗
╠═ Creating update staging area                             ═╣
⠏ Completed files 8357 | 6.5GiB | 99.2MiB/s                                                                                                                                                              

Your current Google Cloud CLI version is: 405.0.0
Installing components from version: 405.0.0

┌───────────────────────────────────────────────────┐
│        These components will be installed.        │
Copying gs://bucket/path/to/anotherfile.xml to file://./anotherfile.xml
├───────────────────────────────┬─────────┬─────────┤
│              Name             │ Version │   Size  │
├───────────────────────────────┼─────────┼─────────┤
│ Google Cloud CRC32C Hash Tool │   1.0.0 │ 1.2 MiB │
└───────────────────────────────┴─────────┴─────────┘

And ultimately it failed:

⠼WARNING: Post processing failed.  Run `gcloud info --show-log` to view the failures.

==> Start a new shell for the changes to take effect.

Update done!

And is hanging:

⠧ Completed files 8608 | 6.7GiB | 99.2MiB/s

(That leftmost character is being animated around.)

And this was a very exciting failure mode indeed, because gcloud is no longer installed at its usual $PATH:

$ gcloud info --show-log
bash: /home/james/applications/google-cloud-sdk/bin/gcloud: No such file or directory
james@host:/mnt/storage 
$ which gcloud

So I guess this is just to say, it may require special care (e.g., making sure the Google Cloud CRC32C Hash Tool is installed) to make sure there is not unexpected behavior...

carbocation commented 2 years ago

After reinstalling gcloud so I could finally finish my download*, I was also able to get a couple of hash mismatches, though this only occurred in about 1 out of every ~25,000 files for me:

ERROR: Source hash kfmcAw== does not match destination hash ebdNqw== for object ./file.xml_.gstmp.

Maybe not quite ready for prime time.

mbookman commented 2 years ago

Hi @carbocation !

Wanted to give an update on this. The Cloud team was able to root cause the problem; it was a fairly straight-forward failure case that needed a retry. The report is that the fix is targeted for Cloud SDK 410.0.0. You can keep an eye out for releases here:

https://cloud.google.com/sdk/docs/release-notes

We'll give dsub integration another pass when see that version drop.

-Matt

carbocation commented 2 years ago

[....] The Cloud team was able to root cause the problem; it was a fairly straight-forward failure case that needed a retry. The report is that the fix is targeted for Cloud SDK 410.0.0. [...]

Thanks for that update! 410.0.0 just came out and I don't see a mention of cloud storage, so I am guessing this fix didn't make it into 410. I'll keep my eyes peeled for when the fix eventually does make its way in.

mbookman commented 2 years ago

The report from Google engineering is that while they missed adding a release note update, the code fix is indeed in 410.0.0.