Speed up cloud storage interaction

dchaley commented 3 weeks ago

We're currently using the Python library for file transfer.

Unfortunately it seems to be very slow: for a 1.2 GB file, it takes 77s to write it:

2024-06-20 18:58:20.708 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz
2024-06-20 18:58:20.708 PDT
Saved output in 76.97 s

and 27s to read it back:

2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-20 19:01:57.229 PDT
Loaded raw predictions in 27.0 s

When we test in cloud shell, in europe-west4-a no less, it's ~6s to download using gcloud storage cp:

$ gcloud storage cp gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz .
Copying gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz to file://./raw_predictions.npz
  Completed files 1/1 | 1.2GiB/1.2GiB | 232.1MiB/s                                                              

Average throughput: 208.1MiB/s

and 13s to write:

Copying file://raw_predictions.npz to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions-2.npz
  Completed files 26/1 | 1.2GiB/1.2GiB | 112.0MiB/s                                                             

Average throughput: 95.4MiB/s

Measurement difference:

Method	python (central <-> central)	gcloud storage (central <-> europe)	Delta
Read	27s	6s	21s (78%)
Write	77s	13s	64s (83%)

Conclusion: rework the scripts to NOT use smart_open (which uses the Python library).

Instead, for reads:

Download remote file to disk by calling out to gcloud storage cp.
Load into memory.
Delete the file.

For writes:

Write memory to disk file.
Upload by calling out to gcloud storage cp.
Delete disk file.

dchaley commented 2 weeks ago

Failed: need to add gcloud to docker image.

FileNotFoundError: [Errno 2] No such file or directory: 'gcloud'

dchaley commented 2 weeks ago

Simply installing gcloud from apt didn't work:

TypeError: Descriptors cannot be created directly.
2024-06-25 00:36:23.081 PDT
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
2024-06-25 00:36:23.081 PDT
If you cannot immediately regenerate your protos, some other possible workarounds are:
2024-06-25 00:36:23.081 PDT
 1. Downgrade the protobuf package to 3.20.x or lower.
2024-06-25 00:36:23.081 PDT
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
2024-06-25 00:36:23.081 PDT
2024-06-25 00:36:23.081 PDT
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Consider following this method: https://github.com/tonymet/gcloud-lite/blob/master/Dockerfile

Looks like we can install gcloud a bit more selectively, and force it to rely on system python vs using its own distribution.

dchaley commented 2 weeks ago

The earlier comparison was apples to oranges. It included compression in the Python API timing, whereas the gcloud storage cp timing was just moving the files.

It turns out that numpy's compression is quite slow. Using pigz, a parallel implementation of gzip (documentation) is a LOT faster.

I'll do proper benchmarking in a post, for now I want to close this out as performance is improved a bunch.

BEFORE

2024-06-20 18:58:20.708 PDT
Loading preprocessed image
2024-06-20 18:58:20.708 PDT
Loaded preprocessed image in 9.18 s
2024-06-20 18:58:20.708 PDT
Running prediction
2024-06-20 18:58:20.708 PDT
Ran prediction in 102.89 s; success: True
2024-06-20 18:58:20.708 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz
2024-06-20 18:58:20.708 PDT
Saved output in 76.97 s
2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-20 19:01:57.229 PDT
Loaded raw predictions in 27.0 s

AFTER

2024-06-26 21:48:33.427 PDT
Loading preprocessed image
2024-06-26 21:48:33.427 PDT
Loaded preprocessed image in 9.07 s
2024-06-26 21:48:33.427 PDT
Running prediction
2024-06-26 21:48:33.427 PDT
Ran prediction in 102.57 s; success: True
2024-06-26 21:48:33.427 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/j3b0a3762-b972-47da-be23-36fee6afda07/raw_predictions.npz.gz
2024-06-26 21:48:33.427 PDT
Saved output in 23.34 s
2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-26 21:51:59.625 PDT
Loaded raw predictions in 16.94 s

For ~1.2GB of raw predictions (~140M px)

Measure	Before	After
Load preprocessed image	9.18 s	9.07 s
Save raw predictions	76.97 s	23.34 s
Load raw predictions	27 s	16.94 s

dchaley commented 2 weeks ago

Done for now.

dchaley / deepcell-imaging

Speed up cloud storage interaction #248