drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
471 stars 59 forks source link

Lots of MD5 logs when download file from GCS #371

Closed aosh-ab7e closed 1 year ago

aosh-ab7e commented 1 year ago

Code:

CloudPath("gcs://<path redacted>").download_to("data")

Error(for every file in the cloud directory):

No MD5 checksum was returned from the service while downloading https://storage.googleapis.com/download/storage/v1/b/<redacted>
(which happens for composite objects), so client-side content integrity
checking is not being performed.
pjbull commented 1 year ago

First, let's check if this is specific to cloudpathlib. What happens if you use the Google SDK directly? Something like:

from google.cloud.storage import Client as StorageClient

client = StorageClient()
bucket = client.bucket("my-bucket")
blob = bucket.get_blob("my-key")
blob.download_to_filename("my-file")

Do you see the same messages when you do this?

Second, can you provide more context on the log message? Is this actually an error or just a warning/info log statement? Does the file exist locally on disk after and does it have the expected content?

From what I can tell with some googling, this seems to be just an info message that GCS spits out if there aren't serverside MD5s available for comparison, and you should be able to quiet these by changing your logging settings.

aosh-ab7e commented 1 year ago

First, let's check if this is specific to cloudpathlib.

I'll test this in near future, thanks.

Second, can you provide more context on the log message?

This is a info level log comes from _get_expected_checksum in google.resumable_media._helpers. (I've fixed the issue title)

Does the file exist locally on disk after and does it have the expected content?

Yes, it works fine but a bunch of logs happens.

I'm sorry that I'm a totally noob, I can't tell which is wrong, Google implemented their things wrong, or by any chance cloudpathlib uses GCS SDK wrong.

pjbull commented 1 year ago

We don't touch any loggers/logging at all, so I don't think it is from cloudpathlib.

I suspect it is some combination of bucket settings and your local configuration variables for Google Cloud Storage.

Here's an example of how you can change the levels of different loggers, that may help: https://betterstack.com/community/questions/how-to-disable-logging-from-python-request-library/

I'm going to close this for now, but if you identify a cloudpathlib issue, you can reopen.