dchaley / deepcell-imaging

Tools & guidance to scale DeepCell imaging on Google Cloud Batch
7 stars 2 forks source link

Speed up cloud storage interaction #248

Closed dchaley closed 2 weeks ago

dchaley commented 3 weeks ago

We're currently using the Python library for file transfer.

Unfortunately it seems to be very slow: for a 1.2 GB file, it takes 77s to write it:

2024-06-20 18:58:20.708 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz
2024-06-20 18:58:20.708 PDT
Saved output in 76.97 s

and 27s to read it back:

2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-20 19:01:57.229 PDT
Loaded raw predictions in 27.0 s

When we test in cloud shell, in europe-west4-a no less, it's ~6s to download using gcloud storage cp:

$ gcloud storage cp gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz .
Copying gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz to file://./raw_predictions.npz
  Completed files 1/1 | 1.2GiB/1.2GiB | 232.1MiB/s                                                              

Average throughput: 208.1MiB/s

and 13s to write:

Copying file://raw_predictions.npz to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions-2.npz
  Completed files 26/1 | 1.2GiB/1.2GiB | 112.0MiB/s                                                             

Average throughput: 95.4MiB/s

Measurement difference:

Method python (central <-> central) gcloud storage (central <-> europe) Delta
Read 27s 6s 21s (78%)
Write 77s 13s 64s (83%)

Conclusion: rework the scripts to NOT use smart_open (which uses the Python library).

Instead, for reads:

  1. Download remote file to disk by calling out to gcloud storage cp.
  2. Load into memory.
  3. Delete the file.

For writes:

  1. Write memory to disk file.
  2. Upload by calling out to gcloud storage cp.
  3. Delete disk file.
dchaley commented 2 weeks ago

Failed: need to add gcloud to docker image.

FileNotFoundError: [Errno 2] No such file or directory: 'gcloud'

dchaley commented 2 weeks ago

Simply installing gcloud from apt didn't work:

TypeError: Descriptors cannot be created directly.
2024-06-25 00:36:23.081 PDT
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
2024-06-25 00:36:23.081 PDT
If you cannot immediately regenerate your protos, some other possible workarounds are:
2024-06-25 00:36:23.081 PDT
 1. Downgrade the protobuf package to 3.20.x or lower.
2024-06-25 00:36:23.081 PDT
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
2024-06-25 00:36:23.081 PDT
2024-06-25 00:36:23.081 PDT
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Consider following this method: https://github.com/tonymet/gcloud-lite/blob/master/Dockerfile

Looks like we can install gcloud a bit more selectively, and force it to rely on system python vs using its own distribution.

dchaley commented 2 weeks ago

The earlier comparison was apples to oranges. It included compression in the Python API timing, whereas the gcloud storage cp timing was just moving the files.

It turns out that numpy's compression is quite slow. Using pigz, a parallel implementation of gzip (documentation) is a LOT faster.

I'll do proper benchmarking in a post, for now I want to close this out as performance is improved a bunch.

BEFORE

2024-06-20 18:58:20.708 PDT
Loading preprocessed image
2024-06-20 18:58:20.708 PDT
Loaded preprocessed image in 9.18 s
2024-06-20 18:58:20.708 PDT
Running prediction
2024-06-20 18:58:20.708 PDT
Ran prediction in 102.89 s; success: True
2024-06-20 18:58:20.708 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz
2024-06-20 18:58:20.708 PDT
Saved output in 76.97 s
2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-20 19:01:57.229 PDT
Loaded raw predictions in 27.0 s

AFTER

2024-06-26 21:48:33.427 PDT
Loading preprocessed image
2024-06-26 21:48:33.427 PDT
Loaded preprocessed image in 9.07 s
2024-06-26 21:48:33.427 PDT
Running prediction
2024-06-26 21:48:33.427 PDT
Ran prediction in 102.57 s; success: True
2024-06-26 21:48:33.427 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/j3b0a3762-b972-47da-be23-36fee6afda07/raw_predictions.npz.gz
2024-06-26 21:48:33.427 PDT
Saved output in 23.34 s
2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-26 21:51:59.625 PDT
Loaded raw predictions in 16.94 s

For ~1.2GB of raw predictions (~140M px)

Measure Before After
Load preprocessed image 9.18 s 9.07 s
Save raw predictions 76.97 s 23.34 s
Load raw predictions 27 s 16.94 s
dchaley commented 2 weeks ago

Done for now.