fog / fog-google

Fog for Google Cloud Platform
MIT License
98 stars 146 forks source link

[storage] File `#body` can download the file twice? #541

Closed 10io closed 3 years ago

10io commented 3 years ago

Hello! 👋

I've been using this adapter to access files on GCP Storage and hit an issue. If I head the file and then I access #body, the file seems to be downloaded twice. This is ok for small files but can quickly become a problem for bigger files.

Here is how to reproduce this:

storage = ::Fog::Storage.new(credentials) # `credentials` is the standard hash with the parameters to connect to Google Storage
directory = storage.directories.new(key: directory) # `directory` is the bucket name
f = directory.files.head(path) # `path` is the path for the file within the bucket
f.body

Here is the log (truncated) from google-api-client when f.body is executed:

Sending HTTP get https://storage.googleapis.com/storage/[object_path]
200
#<HTTP::Message:0x00007fd79d55d170 >
Success - #<Google::Apis::StorageV1::Object:0x00007fd7ac088ff0 >

Success - #<File:[temporary_folder]/fog-google-storage-temp20210720-69390-1uveobl>

Sending HTTP get https://storage.googleapis.com/storage/[object_path]
200
#<HTTP::Message:0x00007fd79d514330 >
Success - #<Google::Apis::StorageV1::Object:0x00007fd79d504980 >

Success - #<File:[temporary_folder]/fog-google-storage-temp20210720-69390-zrp5w8>

I see two temporary files being created. I put some additional logs around those files and I can confirm that both have the same size and the same content. To me, the file really seems to be downloaded twice. 😿 In addition, those Sending HTTP get lines seems to indicate that the API is hit twice.

From my code inspection, the problem seems to come from https://github.com/fog/fog-google/blob/98a00e1bbace13a65b42f1b83c1911ac01ce71c0/lib/fog/storage/google_json/models/file.rb#L48 If I get this correctly, collection.get(identity) and file.body both download the file. Is this expected?

I don't fully grasp that line of code: I don't get why we need collection.get(identity). I assume that it has something to do with a Fog::Storage::GoogleJSON::File coming from a metadata request vs a Fog::Storage::GoogleJSON::File coming from a "full" get (with the alt=media query parameter).

stanhu commented 3 years ago

Is a missing HEAD request the actual source of the duplicate downloads? https://github.com/fog/fog-google/issues/512

stanhu commented 3 years ago

To answer my own question, @10io looks correct. #512 does not appear to be the issue, and I also see a double request here. This implementation of body seems confusing.

stanhu commented 3 years ago

I don't fully grasp that line of code: I don't get why we need collection.get(identity). I assume that it has something to do with a Fog::Storage::GoogleJSON::File coming from a metadata request vs a Fog::Storage::GoogleJSON::File coming from a "full" get (with the alt=media query parameter).

I think this happens because you can get a Fog::Storage::GoogleJSON::File without the body because it can come from the list bucket API (https://cloud.google.com/storage/docs/listing-objects#rest-list-objects). This appears to happen if you call directories.files.to_a. The collection.get(identity) is a lazy load for the full attributes of the object.

Take a look at https://github.com/fog/fog-google/pull/545.

10io commented 3 years ago

Thanks @stanhu for the PR! 🙏

I had the exact same question as you: why do we need the last_modified field?

To me, either the body is loaded or not depending if the current file is a lightweight one (metadata only) or not. If not, simply load the body. I felt that I was missing a piece with that last_modified field.

From what I see, fog-aws is using a similar logic: https://github.com/fog/fog-aws/blob/d7ec997c7b02b2f433828ebf33679671c2e3005c/lib/fog/aws/models/storage/file.rb#L107-L117. The last_modified field is used. Was it copied over here? 🤔

stanhu commented 3 years ago

@10io It's probably copied from there. Could you file an issue/PR for fog-aws?

10io commented 2 years ago

@10io It's probably copied from there. Could you file an issue/PR for fog-aws?

@stanhu I apologize for the late response.

I definitely will, I'm planning to do it this week.

Thanks! 🙏