Azure / azure-storage-net-data-movement

Azure Storage Data Movement Library for .Net
MIT License
275 stars 133 forks source link

Blob download hangs #202

Open zvrba opened 4 years ago

zvrba commented 4 years ago

Which service(blob, file) does this issue concern?

BLOB storage (block blobs)

Which version of the SDK was used?

1.2.0 (latest) from NuGet

On which platform were you using? (.Net Framework version or .Net Core version, and OS version)

NetCore 2.2, Windows 10

How can the problem be reproduced? It'd be better if the code caused the problem can be shared.

This code sometimes just hangs with no visible download progress (monitoring network traffic in task manager).

Microsoft.Azure.Storage.DataMovement.TransferManager.DownloadAsync(
                    (Microsoft.Azure.Storage.Blob.CloudBlob)blobRef, destinationFile,
                    downloadOptions, transferContext, CancellationToken).GetAwaiter().GetResult();,

This is a command-line application, so there's no SynchronizationContext that could cause a deadlock. After a long period of time, the exception shown in the stack trace screenshot is thrown. There are no network connectivity problems.

What problem was encountered?

Download sometimes seems to get deadlocked or some event is missed. The files in question aren't even large (5-20MB), but I'm downloading thousands of them, one after another (i.e., there are no concurrent downloads -- next download starts after the previous one is finished). See the stack traces and thrown exceptions below.

Have you found a mitigation/solution?

No.

MicrosoftTeams-image MicrosoftTeams-image (1) MicrosoftTeams-image (2)

EmmaZhu commented 4 years ago

Hi @zvrba For the error of "The client could not finish the operation within specified timeout", this error is usually reported when DMLib encounters temporary network, and the request cannot be completed in 15minutes.

DMLib supports to resume the transfer job from last checkpoint, it should be able to complete the remaining transferring with a resuming.

TransferContext instance is used to save and pass checkpoint. You can find sample code on how to use TransferContext to resume a transfer job here: https://github.com/Azure/azure-storage-net-data-movement/blob/master/samples/DataMovementSamples/DataMovementSamples/Samples.cs#L151

The sample code shows how to cancel and resume a transfer job. Actually, resuming also works for other exceptions. You can try to resume like samples here if there's error happened: https://github.com/Azure/azure-storage-net-data-movement/blob/master/samples/DataMovementSamples/DataMovementSamples/Samples.cs#L182

Thanks Emma

zvrba commented 4 years ago

this error is usually reported when DMLib encounters temporary network, and the request cannot be completed in 15minutes.

What do you mean by "the" request? Related to that:

  1. The 15 minute timeout: what about downloading huge files that cannot be downloaded in 15 minutes? Will the operation fail and must be restarted from a checkpoint?
  2. Even if there WERE a network glitch on my side, it definitely did NOT last for 15 minutes (as I still could browse the net). So there seems to be something amiss with handling of temporary network errors in this library.
EmmaZhu commented 4 years ago

@zvrba

  1. For a large file, DMLib would split it into chunks, and upload/download chunks in parallel. 15minutes is timeout for one request to upload/download one chunk. If DMLib cannot complete one of these chunks in 15mins, the resume can restart the transfer for remaining chunks. It won't try to reupload/redownload the completed chunks.
  2. It could be an issue that a request/response is lost on network, and DMLib waits for the response until timeout. This is what we should consider to improve like to add more reliable retries when meet this kind of error.

Thanks Emma

zvrba commented 4 years ago

Hi,

I have downloaded the latest source and built it so that I can observe what is happening. During a transfer, the following happens: Exception thrown: 'System.IO.IOException' in System.Net.Security.dll Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

After that, the download simply hangs. If you look at the screenshot:

This seems to be likely a bug in error handling in DMLib. At the very least, the above exception should be treated as fatal error and transfer should be aborted immediately instead of waiting for the timeout to elapse. Otherwise, it'd be nice that the library re-tried the request a couple of times as this seems to be BLOB service problem instead of network connectivity problem.

picturemessage_ae2o34hy amh

EmmaZhu commented 4 years ago

Hi @zvrba ,

Thanks a lot for the detailed investigation.

  1. About the issue of waiting util timeout when encounter the exception 'System.IO.IOException' in System.Net.Security.dll, this should be an error handling issue. The fix may need some change in DMLib's dependency. We'll need to figure out a valid fix for it.

  2. About the spinning in TransferScheduler. When all transfer job are completed, TransferScheduler would stop spinning. Code is here. It will wait on a blocking collection which would not use CPU when empty. After the 1# issue is fixed, the spinning issue in TransferScheduler would also be mitigated.

Thanks Emma