Open brandyn opened 8 years ago
Example patch (two files) which (always!) keeps compressed files compressed with .gz extention:
*** copy_helper.py.orig 2016-08-13 15:30:16.946067339 -0700
--- copy_helper.py 2016-08-13 15:30:28.642134159 -0700
***************
*** 2521,2527 ****
os.unlink(file_name)
raise
! if need_to_unzip or server_gzip:
# Log that we're uncompressing if the file is big enough that
# decompressing would make it look like the transfer "stalled" at the end.
if bytes_transferred > TEN_MIB:
--- 2521,2527 ----
os.unlink(file_name)
raise
! if (need_to_unzip or server_gzip) and not final_file_name.endswith(".gz"): # -bjw
# Log that we're uncompressing if the file is big enough that
# decompressing would make it look like the transfer "stalled" at the end.
if bytes_transferred > TEN_MIB:
*** cp.py.orig 2016-08-13 15:24:30.443979379 -0700
--- cp.py 2016-08-13 15:26:54.568877017 -0700
***************
*** 48,53 ****
--- 48,55 ----
from gslib.util import NO_MAX
from gslib.util import RemoveCRLFFromString
from gslib.util import StdinIterator
+ from gslib.util import ObjectIsGzipEncoded # bjw
+ from gslib.storage_url import StorageUrlFromString # bjw
_SYNOPSIS = """
gsutil cp [OPTION]... src_url dst_url
***************
*** 856,861 ****
--- 858,868 ----
else:
src_obj_metadata = None
+ # bjw:
+ if src_obj_metadata and self.exp_dst_url.IsFileUrl() and ObjectIsGzipEncoded(src_obj_metadata):
+ if not dst_url.url_string.endswith(".gz"):
+ dst_url = StorageUrlFromString(dst_url.url_string + ".gz")
+
elapsed_time = bytes_transferred = 0
try:
if copy_helper_opts.use_manifest:
Not saying this is the way/place to do this, but the two fairly small changes above do "fix" the problem for me (I am now able to cp -r -n my entire dev bucket and it skips all the files I already have including compressed ones--which are stored locally with .gz extension). Obviously it should at least be under a flag (almost wonder if -Z couldn't be used to mean this when downloading since that flag is currently is only considered during uploads).
Note that the above fix will do the wrong thing if a .gz file is double-compressed, so it's not exactly an elegant solution.
What are the content-type and content-encoding of objects in this bucket?
We can probably get rid of the size check as a heuristic since we always use temporary files now (so the final file will never be in a partial state).
As for rsync, it doesn't work well with content-encoding: gzip files. I understand that you don't have control over GooglePlay developer reports encodings, but prior discussions around rsync and transcoding have identified a lot of sharp edges. So I think cp -n is the right solution for your use case.
This problem still exists, even when copying with cp -n
. It makes using the gsutil suite very difficult for any files that have been uploaded with gsutil -z
Typical common, trivial use case:
Attempt to cp -r -n or rsync a GooglePlay developer reports bucket to local disk. (This is a bucket created and maintained by GooglePlay, so there is no option to change content storage/encoding.)
Any compressed files will be re-copied every time, despite cp -n or rsync use.
In the cp -n case, I am guessing it is due to this line in copy_helper.py:
which notices the (uncompressed) destination is not the same size as the (compressed) source, and so assumes a partial prior download.
Possible solution: Provide a flag which automatically adds a .gz extention to the target name and does Not decompress target file locally. This may cleanly extend existing no-clobber logic to compressed files?
[Note also that current logic technically has a (rare) bug: If a partial download happens to decompress to same size as full compressed file, it will not retry due to false no-clobber.]