Azure / azure-storage-azcopy

The new Azure Storage data transfer utility - AzCopy v10
MIT License
613 stars 221 forks source link

Sync option to mirror source -> target #1282

Closed phrak closed 3 years ago

phrak commented 3 years ago

Which version of the AzCopy was used?

10.7.0

Which platform are you using? (ex: Windows, Mac, Linux)

Windows

What command did you run?

azcopy sync [https://blob URL] [C:\LocalPath] --delete-destination

What problem was encountered?

Unable to truly sync a source -> target. Files modified in the destination are ignored without any notification. They are not overwritten, copied, updated or logged as being different. Business impact being that we cannot ensure the source matches the target, nor can we provide a list of files that are different.

How can we reproduce the problem in the simplest way?

  1. Upload any number/type of files to an empty Blob container.
  2. Create a empty target folder on a local file system.
  3. azcopy sync the blob to the local folder.
  4. Confirm that all blob files have been copied to the local folder.
  5. Modify a local file in any way. e.g. add a new line to a TXT file.
  6. Re-run the same azcopy sync command.
  7. Observe that no new/updated files have been copied.
  8. Observe that the modified TXT file has NOT been over-written.
  9. Observe that Open the azcopy log file does not show any skipped/ignored files.

Have you found a mitigation/solution?

Only semi-viable work-around is to run an azcopy copy [source] [destination] --overwrite:true command to re-copy the entire source directory to the target, however this overwrites all files, not just differences, causing issues with very large file sets. The azcopy copy [source] [destination] --overwrite=ifSourceNewer switch is not viable because the Destination is newer than the Source.

Other potential work-around is to azcopy sync to a safe staging location, then use robocopy to truly mirror the staging Source -> final Destination.

The AZCopy Sync Wiki Page describes this behaviour as being by design - Unfortunately this leaves us with very few options to properly sync a folder pair.

zezha-msft commented 3 years ago

Thanks @phrak for reaching out! We are aware of this scenario and it is being tracked here. It does sound like a valid scenario and we hope to support it soon.

zezha-msft commented 3 years ago

To clarify, the blob service does not support storing an exact lmt; the time when the blob is created/modified is automatically the lmt. To avoid re-transferring the same files in a download situation, we can first preserve the source lmt locally (which needs to be added) in previous run, and then compare the lmts during sync and transfer the source files if the corresponding local lmt is different. However, in a copy situation (blob -> blob), this does not work since we cannot preserve the lmts of blobs and rely on it as an accurate indicator of the file's "version".

Thus, the best approach (that we can think of right now) for the mirroring scenario is to simply copy over all the source files, and delete all extra files at the destination.

Please let us know if you have any other thoughts about this.

phrak commented 3 years ago

Hi @zezha-msft , thanks for replying. I hope a solution can be found soon too.

An option to force-overwrite the destination only if it's different from the source would be very useful to avoid re-copying identical files again.

Another thought - Would it be possible to use the MD5 hash property to compare the objects and overwrite if different?

In the mean-time, it sounds like the best work-around is to use azcopy sync to sync the blob to a safe, dedicated staging location, then use robocopy to properly mirror the staging directory Source -> final Destination.

zezha-msft commented 3 years ago

Hi @phrak, the MD5 value can be stored on the service side, but it is not validated against the data, so it'd be hard to use MD5 values as an indication of file content, since they could easily become out of date if the user changed the content but didn't update the MD5 value. More often, there could be files that got uploaded without a stored MD5.

Sorry for the inconvenience. We are aware of this problem, and will look into solving it with a good UX.