Azure / azure-storage-azcopy

The new Azure Storage data transfer utility - AzCopy v10
MIT License
614 stars 222 forks source link

azcopy copy 2+ million folders between regions bogs down #2218

Open WaitingForGuacamole opened 1 year ago

WaitingForGuacamole commented 1 year ago

Which version of the AzCopy was used?

10.18.1

Which platform are you using? (ex: Windows, Mac, Linux)

Windows 10 22H2

What command did you run?

$AZCOPY_SOURCE="mysourcestorageaccountURI and SAS"
$AZCOPY_DESTINATION="mydestinationstorageaccountURI and SAS"
./azcopy copy $AZCOPY_SOURCE $AZCOPY_DESTINATION --recursive --log-level=ERROR --check-length=false

What problem was encountered?

I'm copying a file share with 2.2 million TIFF images, each in its own folder. azcopy copy bogs down no matter what options I choose - if I leave the environment variables like AZCOPY_BUFFER_GB and AZCOPY_CONCURRENCY_VALUE alone and take their defaults, or change them, it does not seem to matter.

the scan finds all of the files, it even copies 10's of thousands of files successfully, but after a couple of hours the rate at which it updates slows to virtually nothing. A few hundred files at every two minute update. At the rate it's going it'll take a really long time to complete.

My scanning log does have a number of errors stating:

connectex: Only one usage of each socket address (protocol/network address/port) is normally permitted.

despite these, it does seem to find the files.

I'm copying from a Premium File Share in East US to a Premium File Share in West US. I'd use account replicas, but that's not supported for this kind of storage account. Each has a private endpoint, and the logs suggest they are being used from the IPs that are being resolved.

Here's some of the console output (having hit return to get each update on its own line a few times):

PS C:\Users\me\azcopy> $AZCOPY_SOURCE="https://mysourceaccount.file.core.windows.net/sharename/folderwith2millionsubfolders/?sv=2022-11-02&ss=f&srt=sco&sp=rwdlc&se=2023-05-19T19:39:01Z&st=2023-05-17T11:39:01Z&spr=https&sig=REDACTED"
PS C:\Users\me\azcopy> $AZCOPY_DESTINATION="https://mydestinationaccount.file.core.windows.net/sharename/folderwith2millionsubfolders/?sv=2022-11-02&ss=f&srt=sco&sp=rwdlc&se=2023-05-18T19:40:14Z&st=2023-05-17T11:40:14Z&spr=https&sig=REDACTED"
PS C:\Users\me\azcopy> ./azcopy copy $AZCOPY_SOURCE $AZCOPY_DESTINATION --recursive --log-level=ERROR --check-length=false
INFO: Please note: the preserve-permissions flag is set to false, thus AzCopy will not copy SMB ACLs between the source and destination. To learn more: https://aka.ms/AzCopyandAzureFiles.
INFO: Scanning...
INFO: Any empty folders will be processed, because source and destination both support folders. For the same reason, properties defined on folders will be processed

Job 4d107bbf-4ce5-2443-49cc-bb48be2dcd51 has started
Log file is located at: C:\Users\me\.azcopy\4d107bbf-4ce5-2443-49cc-bb48be2dcd51.log

INFO: Trying 4 concurrent connections (initial starting point)
INFO: Trying 16 concurrent connections (seeking optimum)
INFO: Reducing progress output frequency to 2m0s, because there are over 1000000 files
INFO: Trying 4 concurrent connections (backing off)
INFO: Trying 8 concurrent connections (seeking optimum)
INFO: Trying 4 concurrent connections (backing off)
INFO: Trying 5 concurrent connections (seeking optimum)
INFO: Trying 4 concurrent connections (at optimum)
INFO:
INFO: Automatic concurrency tuning completed.
INFO:
0.0 %, 55445 Done, 0 Failed, 2220036 Pending, 0 Skipped, 2275481 Total,
0.0 %, 55850 Done, 0 Failed, 2219631 Pending, 0 Skipped, 2275481 Total,
0.0 %, 56700 Done, 0 Failed, 2218781 Pending, 0 Skipped, 2275481 Total,
0.0 %, 57124 Done, 0 Failed, 2218357 Pending, 0 Skipped, 2275481 Total,
0.0 %, 57544 Done, 0 Failed, 2217937 Pending, 0 Skipped, 2275481 Total,

How can we reproduce the problem in the simplest way?

  1. Create a Premium File Share in a storage account in East US
  2. Create a Premium File Share in a storage account in West US
  3. (optional, maybe it's not relevant) Create a Private Endpoint for each account, in the same VNet/subnet. You'll need to stand up a private DNS zone and entry for them as well
  4. (see 3) Create a test Windows VM in a VNet that is peered to the Private Endpoint VNet (all the private endpoints for our applications are in their own environment-specific private endpoint VNet)
  5. (optional) try increasing the provisioned storage in each share to 40000, to mitigate the likelihood of throttling
  6. Create 2.25 million folders in the share in East US
  7. Put a 160KB TIFF image into each folder
  8. Create a SAS token for each storage account, and get the SAS URI for each
  9. Run the above command, replacing the variables above with your SAS URI
  10. Watch it slowly fail.
  11. Watch your billing rapidly rise.

Have you found a mitigation/solution?

No, am wondering if I should just mount the shares in Linux and rsync them. Then again, it'll take a year and a day to enumerate all of those folders, so maybe there's no difference.

I'm willing to try changing environment variables, but they're not particularly well documented - oh, there's documentation, but some have defaults listed, some have a rationale for how they are calculated, others just say they're used and increase them (to what I don't know) if necessary.

adreed-msft commented 1 year ago

Hi there, @WaitingForGuacamole!

Thank you for reaching out.

Re: concurrency value; we'll often recommend users specify AUTO rather than any specific value when performance concerns come to mind. It looks like that was already specified by default.

I would be curious to see the actual job log (not the scanning log) here. It looks like enumeration is completing, but the actual transfers are dramatically slowing down. This can happen for a very wide variety of reasons, including exponential backoff, chunks getting stuck in some nature of waiting state, etc.

Regards, Adele

WaitingForGuacamole commented 1 year ago

Adele,

Thanks for responding! I’m going to run this again, with INFO level logging, and get you logs after it runs overnight.

Is there a way that I can get these to you without posting archives on Github?

Cheers, Steve

gapra-msft commented 1 year ago

Hi @WaitingForGuacamole if you are still experiencing this issue, please reach out with logs to azcopydev AT microsoft.com

adrianmarcu18 commented 1 year ago

@gapra-msft @WaitingForGuacamole I myself have this issue as well. I think the problem is as follows:

We should have an option to first sync the directory structure and only then the files. Or something like setting a flag to always create the parent directory in advance.

What I have done as a workaround, which is pretty annoying is using robocopy to sync directory structure and after that is done, run azcopy again to sync files as well.

But this means I have to mount the storage accounts in Windows, which I prefer not to do if possible.

Also, if you would run robocopy to sync everything, including the files, that is also going to take ages.

Here is the command I use for robocopy:
robocopy \\<source>.file.core.windows.net\<source_path> \\<destination>.file.core.windows.net\<destination_path> /MT:128 /e /xf *

After the folder structure have been created, than azcopy runs really really fast to the end.

Here is a link to a previous issue where a statement has been made that the request failure is what azcopy relies on creating the folders: https://github.com/Azure/azure-storage-azcopy/issues/2179#issuecomment-1528690473