Closed dchaley closed 2 weeks ago
Failed: need to add gcloud to docker image.
FileNotFoundError: [Errno 2] No such file or directory: 'gcloud'
Simply installing gcloud from apt didn't work:
TypeError: Descriptors cannot be created directly.
2024-06-25 00:36:23.081 PDT
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
2024-06-25 00:36:23.081 PDT
If you cannot immediately regenerate your protos, some other possible workarounds are:
2024-06-25 00:36:23.081 PDT
1. Downgrade the protobuf package to 3.20.x or lower.
2024-06-25 00:36:23.081 PDT
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
2024-06-25 00:36:23.081 PDT
2024-06-25 00:36:23.081 PDT
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
Consider following this method: https://github.com/tonymet/gcloud-lite/blob/master/Dockerfile
Looks like we can install gcloud a bit more selectively, and force it to rely on system python vs using its own distribution.
The earlier comparison was apples to oranges. It included compression in the Python API timing, whereas the gcloud storage cp
timing was just moving the files.
It turns out that numpy's compression is quite slow. Using pigz
, a p
arallel i
mplementation of gz
ip (documentation) is a LOT faster.
I'll do proper benchmarking in a post, for now I want to close this out as performance is improved a bunch.
BEFORE
2024-06-20 18:58:20.708 PDT
Loading preprocessed image
2024-06-20 18:58:20.708 PDT
Loaded preprocessed image in 9.18 s
2024-06-20 18:58:20.708 PDT
Running prediction
2024-06-20 18:58:20.708 PDT
Ran prediction in 102.89 s; success: True
2024-06-20 18:58:20.708 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/jb14640a2-7fe9-47fa-9fb9-c5bac1dd0f5f/raw_predictions.npz
2024-06-20 18:58:20.708 PDT
Saved output in 76.97 s
2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-20 19:01:57.229 PDT
Loaded raw predictions in 27.0 s
AFTER
2024-06-26 21:48:33.427 PDT
Loading preprocessed image
2024-06-26 21:48:33.427 PDT
Loaded preprocessed image in 9.07 s
2024-06-26 21:48:33.427 PDT
Running prediction
2024-06-26 21:48:33.427 PDT
Ran prediction in 102.57 s; success: True
2024-06-26 21:48:33.427 PDT
Saving raw predictions output to gs://deepcell-batch-jobs_us-central1/job-runs/j3b0a3762-b972-47da-be23-36fee6afda07/raw_predictions.npz.gz
2024-06-26 21:48:33.427 PDT
Saved output in 23.34 s
2024-06-20 19:01:57.229 PDT
Loading raw predictions
2024-06-26 21:51:59.625 PDT
Loaded raw predictions in 16.94 s
For ~1.2GB of raw predictions (~140M px)
Measure | Before | After |
---|---|---|
Load preprocessed image | 9.18 s | 9.07 s |
Save raw predictions | 76.97 s | 23.34 s |
Load raw predictions | 27 s | 16.94 s |
Done for now.
We're currently using the Python library for file transfer.
Unfortunately it seems to be very slow: for a 1.2 GB file, it takes 77s to write it:
and 27s to read it back:
When we test in cloud shell, in europe-west4-a no less, it's ~6s to download using
gcloud storage cp
:and 13s to write:
Measurement difference:
Conclusion: rework the scripts to NOT use smart_open (which uses the Python library).
Instead, for reads:
gcloud storage cp
.For writes:
gcloud storage cp
.