Azure / blobxfer

Azure Storage transfer tool and data movement library
MIT License
151 stars 38 forks source link

Download/synccopy rename option doesn't work if remote source path is a prefix of another existing path #119

Closed HansBrende closed 4 years ago

HansBrende commented 4 years ago

Problem Description

In azure blob storage I have 2 files: mycontainer/mydir/file mycontainer/mydir/file.txt

Downloading

When I download the first file, i.e., setting the remote source path to mycontainer/mydir/file, and setting the rename option to True, the downloaded file actually contains the contents of mydir/file.txt, not mydir/file. This is because you are doing a list files operation where all the files that have a prefix of mydir/file are listed as the source files to download. Since rename is specified to true, they all download to the same local destination path, and the last one downloaded overwrites all the others (which in this case is mydir/file.txt).

Synccopy

When I synccopy the first file to a different location, the exact same problem occurs, except in this case your error handling is smart enough at least to realize that there are multiple remote sources with the same derived remote destination path, so it raises a RuntimeError: duplicate destination entity detected: mystorageaccount.blob.core.windows.net/mycontainer/mydir2/file.

WORKAROUND I can successfully workaround this problem in both cases by adding the following include parameter:

def create_source_path(remote_src_path, my_storage_account):
    asp = AzureSourcePath()
    asp.add_path_with_storage_account(remote_src_path, my_storage_account)
    asp.add_includes([remote_src_path.lstrip('/').split('/', 1)[1]])

Azure blobxfer parameters output

N/A: I am using the python API directly

Steps to Reproduce

  1. Create 2 files in Azure storage, where the path of the first is a prefix of the path of the second
  2. Do a download or synccopy operation on the first file, setting the rename option to true

Expected Results

Expected results are that the FIRST file will be downloaded or synccopied successfully (the file that was specified as the remote source file). I.e., should display exactly the same behavior as for the upload command, which works correctly under the same scenario.

Actual Results

In the case of download, the second file is downloaded to the specified path instead of the first. In the case of synccopy, a RuntimeError is raised.

alfpark commented 4 years ago

Thanks for filing the issue. This limitation is documented here: https://github.com/Azure/blobxfer/blob/master/docs/99-current-limitations.md

HansBrende commented 4 years ago

@alfpark Thanks for the link. Even though the prefix-matching behavior is "by design", I would still consider the download behavior a bug. For rename = True, the second file downloaded will overwrite the first file downloaded. That is very surprising behavior which is not documented in the current-limitations link.

Correct behavior should be at the very least to throw an error (as with the synccopy behavior) since the destination path will be the same for all source paths when rename = True, IMO.