Downloading (content type) compressed files breaks both cp -n and rsync

brandyn commented 8 years ago

Typical common, trivial use case:

Attempt to cp -r -n or rsync a GooglePlay developer reports bucket to local disk. (This is a bucket created and maintained by GooglePlay, so there is no option to change content storage/encoding.)

Any compressed files will be re-copied every time, despite cp -n or rsync use.

In the cp -n case, I am guessing it is due to this line in copy_helper.py:

  # The local file may be a partial. Check the file sizes.
  if src_obj_size == os.path.getsize(dst_url.object_name):
    raise ItemExistsError()

which notices the (uncompressed) destination is not the same size as the (compressed) source, and so assumes a partial prior download.

Possible solution: Provide a flag which automatically adds a .gz extention to the target name and does Not decompress target file locally. This may cleanly extend existing no-clobber logic to compressed files?

[Note also that current logic technically has a (rare) bug: If a partial download happens to decompress to same size as full compressed file, it will not retry due to false no-clobber.]

brandyn commented 8 years ago

Example patch (two files) which (always!) keeps compressed files compressed with .gz extention:

*** copy_helper.py.orig 2016-08-13 15:30:16.946067339 -0700
--- copy_helper.py      2016-08-13 15:30:28.642134159 -0700
***************
*** 2521,2527 ****
          os.unlink(file_name)
        raise

!   if need_to_unzip or server_gzip:
      # Log that we're uncompressing if the file is big enough that
      # decompressing would make it look like the transfer "stalled" at the end.
      if bytes_transferred > TEN_MIB:
--- 2521,2527 ----
          os.unlink(file_name)
        raise

!   if (need_to_unzip or server_gzip) and not final_file_name.endswith(".gz"):    # -bjw
      # Log that we're uncompressing if the file is big enough that
      # decompressing would make it look like the transfer "stalled" at the end.
      if bytes_transferred > TEN_MIB:

*** cp.py.orig  2016-08-13 15:24:30.443979379 -0700
--- cp.py       2016-08-13 15:26:54.568877017 -0700
***************
*** 48,53 ****
--- 48,55 ----
  from gslib.util import NO_MAX
  from gslib.util import RemoveCRLFFromString
  from gslib.util import StdinIterator
+ from gslib.util import ObjectIsGzipEncoded  # bjw
+ from gslib.storage_url import StorageUrlFromString # bjw

  _SYNOPSIS = """
    gsutil cp [OPTION]... src_url dst_url
***************
*** 856,861 ****
--- 858,868 ----
      else:
        src_obj_metadata = None

+     # bjw:
+     if src_obj_metadata and self.exp_dst_url.IsFileUrl() and ObjectIsGzipEncoded(src_obj_metadata):
+         if not dst_url.url_string.endswith(".gz"):
+             dst_url = StorageUrlFromString(dst_url.url_string + ".gz")
+ 
      elapsed_time = bytes_transferred = 0
      try:
        if copy_helper_opts.use_manifest:

Not saying this is the way/place to do this, but the two fairly small changes above do "fix" the problem for me (I am now able to cp -r -n my entire dev bucket and it skips all the files I already have including compressed ones--which are stored locally with .gz extension). Obviously it should at least be under a flag (almost wonder if -Z couldn't be used to mean this when downloading since that flag is currently is only considered during uploads).

Note that the above fix will do the wrong thing if a .gz file is double-compressed, so it's not exactly an elegant solution.

thobrla commented 8 years ago

What are the content-type and content-encoding of objects in this bucket?

We can probably get rid of the size check as a heuristic since we always use temporary files now (so the final file will never be in a partial state).

thobrla commented 8 years ago

As for rsync, it doesn't work well with content-encoding: gzip files. I understand that you don't have control over GooglePlay developer reports encodings, but prior discussions around rsync and transcoding have identified a lot of sharp edges. So I think cp -n is the right solution for your use case.

vincentwoo commented 2 years ago

This problem still exists, even when copying with cp -n. It makes using the gsutil suite very difficult for any files that have been uploaded with gsutil -z

GoogleCloudPlatform / gsutil

Downloading (content type) compressed files breaks both cp -n and rsync #379