GoogleCloudPlatform / gsutil

A command line tool for interacting with cloud storage services.
Apache License 2.0
873 stars 333 forks source link

Issue with use_gcloud_storage shim in cp local to bucket upload #1712

Closed suchita-mehta-clootrack closed 1 year ago

suchita-mehta-clootrack commented 1 year ago

This issue was identified in local to bucket upload where the destination directory does not exist in the bucket

Consider the source to have the following contents: file_1.csv folder1/file_2.csv

Using the following gsutil command: gsutil -m cp -R //* gs:///non_existent_folder

the source hierarchy is not maintained in the non_existent_folder, this is how the source folder gets uploaded gs:///non_existent_folder/file_1.csv gs:///non_existent_folder/file_2.csv

The same using gsutil command with use_gcloud_storage=true shim,

gsutil -o GSUtil:use_gcloud_storage=True -m cp -R //* gs:///non_existent_folder,

This is how the source folder gets uploaded: gs:///non_existent_folder/file_1.csv gs:///non_existent_folder/folder1/file_2.csv

Ideally these behaviour differences should not exist since shim is supposed to be an easy transition between the gsutil commands to gcloud storage commands. Anyone faced this issue before or if there is a fix for this?

thomasmaclean commented 1 year ago

I wasn't able to reproduce this behavior:

$ tree test
test
├── file_1.csv
└── folder1
    └── file_2.csv

2 directories, 2 files
$ gsutil -m cp -R test/* gs://my-bucket/test0001
Copying file://test/file_1.csv [Content-Type=text/csv]...                      
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...
\ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done  40.0 KiB/s ETA 00:00:00           
Operation completed over 2 objects/23.7 KiB.
$ gsutil ls gs://my-bucket/test0001
gs://my-bucket/test0001/file_1.csv
gs://my-bucket/test0001/folder1/
$ gsutil -o GSUtil:use_gcloud_storage=True -m cp -R  test/* gs://my-bucket/test0002
Copying file://test/file_1.csv to gs://my-bucket/test0002/file_1.csv
Copying file://test/folder1/file_2.csv to gs://my-bucket/test0002/folder1/file_2.csv
  Completed files 2/2 | 23.7kiB/23.7kiB                                                                                                                                                                                                                                                                                                                                                                                                

Average throughput: 4.3MiB/s
$ gsutil ls gs://my-bucket/test0002
gs://my-bucket/test0002/file_1.csv
gs://my-bucket/test0002/folder1/

You can see the behavior is identical for both. The source you show in the beginning is a bit ambiguous, in the way it's not clear if file_2.csv is inside folder1. It definitely appears to have been in the shim example but not in the first.

suchita-mehta-clootrack commented 1 year ago

The tree structure you have assumed is correct. Can you please confirm if the directory test0001 doesn't exist in the bucket before you perform this operation?

thomasmaclean commented 1 year ago

I can confirm that was true in both cases, yes. Was there another detail I missed?

suchita-mehta-clootrack commented 1 year ago

Can you try the same with quotes in your source and destination? gsutil -m cp -R 'test/*' 'gs://my-bucket/test0001' Here are some of the details I may have missed: Also not that this might be causing issues, I am using a nearline storage bucket in us-central1 (Iowa) Also the actual operation I faced this issue was when I was performing this operation three levels deep in destination bucket, so your destination would become gs://my-bucket/test0001/test0002/test0003 where test0003 does not exist, and test0001/test0002 exist in the bucket

thomasmaclean commented 1 year ago

For now, I'm just looking at the gsutil behavior (without the shim), as the first example, where files were flattened out of directories seemed a bit strange.

$ gsutil -m cp -R test/* gs://my-bucket/test0001/test0002
Copying file://test/file_1.csv [Content-Type=text/csv]...
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...
/ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done
Operation completed over 2 objects/23.7 KiB.
$ gsutil ls gs://my-bucket/test0001/test0002
gs://my-bucket/test0001/test0002/file_1.csv
gs://my-bucket/test0001/test0002/folder1/
$ gsutil -m cp -R test/* gs://my-bucket/test0001/test0002/test0003
Copying file://test/file_1.csv [Content-Type=text/csv]...
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...
/ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done
Operation completed over 2 objects/23.7 KiB.
$ gsutil ls gs://my-bucket/test0001/test0002/test0003
gs://my-bucket/test0001/test0002/test0003/file_1.csv
gs://my-bucket/test0001/test0002/test0003/folder1/

With the quotes there's no difference:

$ gsutil -m cp -R 'test/*' 'gs://my-bucket/test0001/test0002/test0004'
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...
Copying file://test/file_1.csv [Content-Type=text/csv]...
/ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done
Operation completed over 2 objects/23.7 KiB.
$ gsutil ls gs://my-bucket/test0001/test0002/test0004
gs://my-bucket/test0001/test0002/test0004/file_1.csv
gs://my-bucket/test0001/test0002/test0004/folder1/

It's worth mentioning that double star (**) has the effect of flattening out directory structures when copying nested files, and that behavior works with and without the shim:

$ gsutil -m cp -R 'test/**' 'gs://my-bucket/test0001/test0002/test0005'
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...
Copying file://test/file_1.csv [Content-Type=text/csv]...
/ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done
Operation completed over 2 objects/23.7 KiB. 
$ gsutil ls gs://my-bucket/test0001/test0002/test0005
gs://my-bucket/test0001/test0002/test0005/file_1.csv
gs://my-bucket/test0001/test0002/test0005/file_2.csv
$ gsutil -m cp -R test/** gs://my-bucket/test0001/test0002/test0006
Copying file://test/file_1.csv [Content-Type=text/csv]...
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...  
/ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done
Operation completed over 2 objects/23.7 KiB. 
$ gsutil ls gs://my-bucket/test0001/test0002/test0006
gs://my-bucket/test0001/test0002/test0006/file_1.csv
gs://my-bucket/test0001/test0002/test0006/folder1/
$ gsutil -o GSUtil:use_gcloud_storage=True -m cp -R 'test/**' 'gs://my-bucket/test0001/test0002/test0007'
Copying file://test/folder1/file_2.csv to gs://my-bucket/test0001/test0002/test0007/file_2.csv
Copying file://test/file_1.csv to gs://my-bucket/test0001/test0002/test0007/file_1.csv
  Completed files 2/2 | 23.7kiB/23.7kiB

Average throughput: 10.2MiB/s
$ gsutil ls gs://my-bucket/test0001/test0002/test0007
gs://my-bucket/test0001/test0002/test0007/file_1.csv
gs://my-bucket/test0001/test0002/test0007/file_2.csv
$ gsutil -o GSUtil:use_gcloud_storage=True -m cp -R test/** gs://my-bucket/test0001/test0002/test0008
Copying file://test/file_1.csv to gs://my-bucket/test0001/test0002/test0008/file_1.csv
Copying file://test/folder1/file_2.csv to gs://my-bucket/test0001/test0002/test0008/folder1/file_2.csv
  Completed files 2/2 | 23.7kiB/23.7kiB

Average throughput: 822.4kiB/s
$ gsutil ls gs://my-bucket/test0001/test0002/test0008
gs://my-bucket/test0001/test0002/test0008/file_1.csv
gs://my-bucket/test0001/test0002/test0008/folder1/

Note, as demonstrated you do need to include quotes for ** to work, otherwise Linux will take over parsing the parameters.

suchita-mehta-clootrack commented 1 year ago

So the command only has one trailing *. I tried the same operation on another bucket and could not replicate this behavior, however one difference that I could note was that the bucket type was standard and one where flat hierarchy was being created was nearline.

thomasmaclean commented 1 year ago

Interesting... I've never seen a difference based on storage type, but I can try that as well.

$ gsutil mb -c nearline gs://my-nearline-test
Creating gs://my-nearline-test/...
$ gsutil cp test.txt gs://my-nearline-test/test0001/test0002/
Copying file://test.txt [Content-Type=text/plain]...
- [1 files][ 11.8 KiB/ 11.8 KiB]                                                
Operation completed over 1 objects/11.8 KiB.
$ gsutil -m cp -R 'test/*' 'gs://my-nearline-test/test0001/test0002/test0003'
Copying file://test/folder1/file_2.csv [Content-Type=text/csv]...
Copying file://test/file_1.csv [Content-Type=text/csv]...
/ [2/2 files][ 23.7 KiB/ 23.7 KiB] 100% Done                                    
Operation completed over 2 objects/23.7 KiB.
$ gsutil ls 'gs://my-nearline-test/test0001/test0002/test0003'
gs://my-nearline-test/test0001/test0002/test0003/file_1.csv
gs://my-nearline-test/test0001/test0002/test0003/folder1/

There doesn't seem to be any change in behavior. The shim behavior is identical as well:

$ gsutil -o GSUtil:use_gcloud_storage=True -m cp -R 'test/*' 'gs://my-nearline-test/test0001/test0002/test0004'
Copying file://test/folder1/file_2.csv to gs://my-nearline-test/test0001/test0002/test0004/folder1/file_2.csv
Copying file://test/file_1.csv to gs://my-nearline-test/test0001/test0002/test0004/file_1.csv
  Completed files 2/2 | 23.7kiB/23.7kiB                                                                                                                                                                                                                                                  

Average throughput: 12.6MiB/s
$ gsutil ls 'gs://my-nearline-test/test0001/test0002/test0004'
gs://my-nearline-test/test0001/test0002/test0004/file_1.csv
gs://my-nearline-test/test0001/test0002/test0004/folder1/
suchita-mehta-clootrack commented 1 year ago

I missed out on another important detail, I am running this command on gcloud version 212.0.0, ofcourse shim can't be run there but just the behaviour of gsutil command

thomasmaclean commented 1 year ago

Can you give me an output of gsutil version -l? If gsutil is ignoring the shim setting, that could just mean you're running into an edge case of gsutil behavior rather than this having anything to do with the shim.

suchita-mehta-clootrack commented 1 year ago

gsutil version: 4.33 checksum: 4d4290f4916b57b9412f338cbed076b0 (OK) boto version: 2.48.0 python version: 2.7.17 (default, Mar 8 2023, 18:40:28) [GCC 7.5.0] OS: Linux 5.4.0-1106-gcp multiprocessing available: True using cloud sdk: True pass cloud sdk credentials to gsutil: False config path(s): /etc/boto.cfg gsutil path: /usr/lib/google-cloud-sdk/platform/gsutil/gsutil compiled crcmod: True installed via package manager: False editable install: False

You are right, this may be an issue with the older gsutil version, can you confirm if you can replicate it on this gsutil version?

thomasmaclean commented 1 year ago

I think a better question is if it's possible for you to replicate the issue on the latest version... if you do please reopen the ticket and let us know.