fog / fog-google

Fog for Google Cloud Platform
MIT License
99 stars 146 forks source link

GCS get_object fails with charcode conversion error #533

Closed mttkay closed 3 years ago

mttkay commented 3 years ago

Broken out of this discussion: https://github.com/fog/fog-google/pull/530/files#r645936386

With fog-google 1.14.0 we are seeing the following error when using the File.copy method and the target object is a binary file (in this case, a Java JAR file):

Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
/opt/gitlab/embedded/lib/ruby/2.7.0/delegate.rb:343:in `write'
/opt/gitlab/embedded/lib/ruby/2.7.0/delegate.rb:343:in `block in delegating_block'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/download.rb:97:in `block in execute_once'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1093:in `block in follow_redirect'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1250:in `block in do_get_block'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:894:in `read_body_length'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:587:in `get_body'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1247:in `do_get_block'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1019:in `block in do_request'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1133:in `protect_keep_alive_disconnected'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1014:in `do_request'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:1104:in `follow_redirect'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:854:in `request'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/httpclient-2.8.3/lib/httpclient.rb:743:in `get'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/download.rb:77:in `execute_once'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/http_command.rb:113:in `block (2 levels) in execute'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:61:in `block in retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `times'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/http_command.rb:110:in `block in execute'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:61:in `block in retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `times'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/retriable-3.1.2/lib/retriable.rb:56:in `retriable'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/http_command.rb:102:in `execute'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/lib/google/apis/core/base_service.rb:366:in `execute_or_queue_command'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/google-api-client-0.50.0/generated/google/apis/storage_v1/service.rb:1914:in `get_object'
/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/fog-google-1.14.0/lib/fog/storage/google_json/requests/get_object.rb:58:in `get_object'

Apparently somewhere in the communication between client -> fog-google/json-storage -> google-api-ruby-client -> GCS there is a problem with how byte streams are interpreted.

Possibly related problem: googleapis/google-api-ruby-client#74

I believe what is happening here is that somewhere, that binary byte stream is being interpreted as a UTF-8 string, which it is not:

[13:47:19] ~/Downloads hexdump broken_file.jar | grep c4
0000000 4b50 0403 000a 0000 0000 4bc4 52d7 0000
0000050 494e 4546 5453 4d2e f346 cc4d 4ccb 2d4b
00000d0 000a 0000 0000 4bc4 52d7 0000 0000 0000
0000100 0000 c400 d74b 0052 0000 0000 0000 0000
0000140 0000 4bc4 52d7 0000 0000 0000 0000 0000
0000280 4cd6 d822 8d50 2b14 df43 6270 c42c df2a
0000320 02fc 4b50 0403 000a 0000 0008 4bc4 52d7
0000370 7265 6974 7365 7653 cd4f 2d4b 2c4a 4d49
00003e0 0314 000a 0000 0000 4bc4 52d7 0000 0000
0000460 c400 d74b 0052 0000 0000 0000 0000 0000
0000490 2f6e 4b50 0201 0314 000a 0000 0000 4bc4
00004e0 502f 014b 1402 0a03 0000 0000 c400 d74b
00005a0 4bc4 52d7 0b31 d1ba 0066 0000 006a 0000

Note how there are multiple sequences containing c4XX multi-byte sequences, which denote certain characters with diacritics in Unicode: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256

Request logs of the failing request:

= Request

GET /storage/v1/b/mk-omnibus-upload-test/o/lost_and_found%2F79%2F02%2F7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451%2Fpackages%2F1%2Ffiles%2F1%2Fmy-app-1.0.jar?alt=media&alt=media HTTP/1.1
User-Agent: fog/1.14.0 google-api-ruby-client/0.50.0 Linux/5.11.0-7614-generic (gzip)
X-Goog-Api-Client: gl-ruby/2.7.2 gdcl/0.50.0
authorization: Bearer <snip>
Accept: */*
Accept-Encoding: gzip,deflate
Date: Wed, 23 Jun 2021 13:23:13 GMT
Host: storage.googleapis.com

= Response

HTTP/1.1 200 OK
X-GUploader-UploadID: ABg5-Uz9vhhuXVa8qQ07HuqtS-6XzUCaj-APwTyXLpZnN7jFud0MU4YwmWmu0RAYERhUORO0dA1dbUCi78STRKoxV3Q
ETag: COTDl/jsrfECEAE=
X-Goog-Hash: crc32c=TqHrHA==,md5=ugIGHxsWpmeASfUTe3PX+A==
Content-Type: application/octet-stream
X-Goog-Generation: 1624454593896932
X-Goog-Metageneration: 1
X-Goog-Storage-Class: STANDARD
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Content-Disposition: attachment
Date: Wed, 23 Jun 2021 13:23:14 GMT
Vary: Origin
Vary: X-Origin
Content-Length: 1550
Server: UploadServer
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

PK
�K�R    META-INF/PK
K-*��ϳR0�3��r,J��,K-�V�+�$x���RKRSt�*��
                                      �3R|�R�����y��J3sJ���%9�I�E�yy � �]��l�]@��
y�x�PK
�K�RMETA-INF/maven/PK
�K�R!META-INF/maven/com.mycompany.app/PK
�K�R(META-INF/maven/com.mycompany.app/my-app/PK
�J�R��q%W
         /META-INF/maven/com.mycompany.app/my-app/pom.xml�Q�O�0��W���ڡ�H]b��
                                                                            �J�-e����H[6��n)�
                                                                                             ���;4�{���{��T�$��f9׷��9���p�%N���d
                                                                                                                                � ��S֚~�˲��]A��GOv�a�rm0O�! CONNECTION CLOSED
D, [2021-06-23T13:23:14.031026 #36838] DEBUG -- : Error - #<Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8>
mttkay commented 3 years ago

Looking at https://cloud.google.com/storage/docs/json_api/v1/objects/get#response, I noticed this paragraph:

By default, this responds with an object resource in the response body. If you provide the URL parameter alt=media, then it will respond with the object data in the response body.

And in the request URL, we indeed send ?alt=media (twice for some reason.)

So GCS will send back a binary file, and it looks like we then try to parse this as JSON?

mttkay commented 3 years ago

I verified this isn't happening in fog-google 1.13. Here is how that request looks like:

D, [2021-06-23T13:50:13.269963 #1476] DEBUG -- : Sending HTTP get https://storage.googleapis.com/storage/v1/b/mk-omnibus-upload-test/o/lost_and_found%2F79%2F02%2F7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451%2Fpackages%2F1%2Ffiles%2F1%2Fmy-app-1.0.jar?
D, [2021-06-23T13:50:13.323003 #1476] DEBUG -- : 200

It looks like request logs look completely different, not sure why.

mttkay commented 3 years ago

I just noticed that in 1.14 the copy_object implementation has changed w.r.t. option handling: https://github.com/fog/fog-google/pull/520/files

I wonder if it's related.

mttkay commented 3 years ago

I wonder if it's related.

Maybe not related. I tested with those changes reverted and the problem persists.

stanhu commented 3 years ago

@mttkay Could you try reverting e83c58efa1c7031ad82997607d739b672b9b3be1? I wonder if we just need to open Tempfile with binmode: true.

stanhu commented 3 years ago

Ok, I reproduced the problem. Indeed, we need to call buf.binmode to activate the binary stream.

stanhu commented 3 years ago

https://github.com/fog/fog-google/pull/534 solves this issue.

mttkay commented 3 years ago

534 solves this issue.

Yes I can confirm this fixes the issue. Thanks @stanhu !

geemus commented 3 years ago

Fix released in v1.15.0.

mttkay commented 3 years ago

Fix released in v1.15.0.

Thank you @geemus ! :bow: