Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.61k stars 2.82k forks source link

Dataset.download() hangs for a long time and, when done, some files are missing in the destination folder #21878

Closed janluke closed 2 years ago

janluke commented 2 years ago

Describe the bug I tried to use Dataset.download() method to download a registered dataset (made of multiple files) in my personal computer (Windows 10, ~50Mbps connection). For small test datasets (a few MBs), it works as expected. For bigger datasets (~3GB) the download hangs or it terminates after a long time with no exception or logging errors. Furthermore, some of the files are missing in the target folder. The same happens in the macOS laptop of my colleague.

Everything works properly in my Azure ML virtual machine (running Linux).

To Reproduce

  1. Register a dataset from a datastore.
  2. Try to download it with

    from azureml.core import Workspace, Dataset
    
    # Fill workspace arguments
    workspace = Workspace.get(
      name="",
      subscription_id="",
      resource_group="", 
      auth=InteractiveLoginAuthentication(tenant_id="")
    )
    
    dataset= Dataset.get_by_name(workspace, "<dataset-name>")
    dataset.download("your/target/path")

Expected behavior The dataset is downloaded to "your/target/path".

xiangyan99 commented 2 years ago

Thanks for the feedback, we’ll investigate asap.

PramodValavala-MSFT commented 2 years ago

@janluke I believe in most cases this would happen due to network stability issues. Looks like the method doesn't overwrite files by default, so running it a second time should do I suppose but I suppose a resume option would be better.

We are routing this to the team concerned for more insights.

ghost commented 2 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
- **Package Name**: azureml-core | azureml-dataset-runtime - **Package Version**: 1.36.0.post2 | 1.36.0 - **Operating System**: Windows 10, macOS - **Python Version**: 3.8 **Describe the bug** I tried to use `Dataset.download()` method to download a registered dataset (made of multiple files) in my personal computer (Windows 10, ~50Mbps connection). For small test datasets (a few MBs), it works as expected. For bigger datasets (~3GB) the download hangs or it terminates after a long time with no exception or logging errors. Furthermore, some of the files are missing in the target folder. The same happens in the macOS laptop of my colleague. Everything works properly in my Azure ML virtual machine (running Linux). **To Reproduce** 1. Register a dataset from a datastore. 2. Try to download it with ```python from azureml.core import Workspace, Dataset # Fill workspace arguments workspace = Workspace.get( name="", subscription_id="", resource_group="", auth=InteractiveLoginAuthentication(tenant_id="") ) dataset= Dataset.get_by_name(workspace, "") dataset.download("your/target/path") ``` **Expected behavior** The dataset is downloaded to "your/target/path".
Author: janluke
Assignees: PramodValavala-MSFT
Labels: `bug`, `Machine Learning`, `Service Attention`, `Client`, `customer-reported`, `needs-team-attention`, `CXP Attention`
Milestone: -
janluke commented 2 years ago

@PramodValavala-MSFT The issue was consistently reproduced on three different machines, OSes and networks connections. None of us has been able to use this method to download a dataset of a few GB locally even a single time. If network stability is the issue, I'm afraid the network stability requirement of this method is too high for most people :)

SaurabhSharma-MSFT commented 2 years ago

@azureml-github Can you please help here.

diondrapeck commented 2 years ago

@janluke - Do you know if this is a file dataset or a tabular dataset that you're trying to download?

janluke commented 2 years ago

File dataset. A folder.

anliakho2 commented 2 years ago

@janluke Sorry for the delay in getting back to you. This is definitely neither expected nor something that we have seen before. Given that you can repro this across machines and users, I think the issue is in your specific set of files and setup. To help investigate this better could you please share some additional info with me:

  1. What is the datastore type used in this scenario (Azure Blob, Azure Data Lake Gen 2, Azure File Share)?
  2. how is the file dataset defined? Ideally, you could post output of dataset._dataflow._steps in your envronment
  3. what is the structure of the source datastore in terms of number of files and folders?
  4. have you configured and VPNs, custom proxys or crednentialless datastores?
  5. and last but not least please share you telemtry session id by running before a new attempt at download (would let us check telemetry on our side)
    from azureml._base_sdk_common import _ClientSessionId
    print(_ClientSessionId)

Thanks !

janluke commented 2 years ago
  1. What is the datastore type used in this scenario (Azure Blob, Azure Data Lake Gen 2, Azure File Share)?

Azure Blob Storage.

  1. how is the file dataset defined? Ideally, you could post output of dataset._dataflow._steps in your environment

Can't run that now and I'm not sure I understood the question. The dataset was created by using the Azure ML Studio web interface ("create dataset from datastore").

  1. what is the structure of the source datastore in terms of number of files and folders?

Number of files seems irrelevant. The real dataset looks like this: image there's a folder containing 50 html files with analysis (1.5 MB total). But I tried downloading a dataset containing only the 1.3 GB file and had the same problem.

  1. have you configured any VPNs, custom proxys or credential-less datastores?

No

  1. and last but not least please share you telemtry session id by running before a new attempt at download (would let us check telemetry on our side)

ef778d69-6b7d-4412-8031-29936b4ee656 I saved the debug-level log. If it doesn't contain any sensitive information I'll share it in a gist (I'll replace workspace information with placeholder anyway).

mickare commented 2 years ago

Hey, I'm currently debugging a random related issue, I'm having when downloading large files (>10GB) in concurrent mode (4 threads), that will hang on the last chunk.

While debugging a download of a large file, I had some small VPN hickups at 37%. The download progress stopped and is not resuming again. When I pause and connect the debugger, all threads in the pool of StorageStreamDownloader.readinto(...) are in a non-suspended state: Frames not available in non-suspended state.

I suspect that at some point the sdk is waiting for data to receive, but because of the short connection loss it will never receive any.

Further I suspect that the custom HTTP client of the Azure SDK is not able to recover from such failure state and keeps waiting for new data from the broken socket. Can someone confirm?

mickare commented 2 years ago

Uff,....

Is it maybe a thread-safety issue? Could explain that it sometimes happens randomly....

https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/azure/core/pipeline/transport/_requests_basic.py#L208

class RequestsTransport(HttpTransport):
    """Implements a basic requests HTTP sender.

    Since requests team recommends to use one session per requests, you should
    not consider this class as thread-safe, since it will use one Session
    per instance.
    """

That is called in multiple threads in: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/azure/storage/blob/_download.py#L612

        if parallel:
            import concurrent.futures
            with concurrent.futures.ThreadPoolExecutor(self._max_concurrency) as executor:
                list(executor.map(
                        with_current_context(downloader.process_chunk),
                        downloader.get_chunk_offsets()
                    ))
mickare commented 2 years ago

Also, why is the read_timeout 22 hours?

https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/constants.py#L23

READ_TIMEOUT = 20

# for python 3.5+, there was a change to the definition of the socket timeout (as far as socket.sendall is concerned)
# The socket timeout is now the maximum total duration to send all data.
if sys.version_info >= (3, 5):
    # the timeout to connect is 20 seconds, and the read timeout is 80000 seconds
    # the 80000 seconds was calculated with:
    # 4000MB (max block size)/ 50KB/s (an arbitrarily chosen minimum upload speed)
    READ_TIMEOUT = 80000

The requests timeout parameter is set to timeout =(20, 80000).

From https://docs.python-requests.org/en/master/user/advanced/#timeouts :

Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it’s the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).

mickare commented 2 years ago

I confirmed it by setting the read_timeout to 300 seconds. The read operation on the last chunk is hanging and blocking the whole download. There is a deadlock!

Downloading 36b90544 (1/1): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 9.83G/9.84G [22:25<00:01, 7.85MB/s]
Traceback (most recent call last):
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 463, in read
    n = self.readinto(b)
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 507, in readinto
    n = self.fp.readinto(b)
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/requests/models.py", line 758, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 576, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='xxx.blob.core.windows.net', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "... xxx ..."
    blob_client.download_blob(read_timeout=300, **downloader_kwargs).readinto(dst_fp)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 613, in readinto
    list(executor.map(
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 129, in process_chunk
    chunk_data = self._download_chunk(chunk_start, chunk_end - 1)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 212, in _download_chunk
    chunk_data = process_content(response, offset[0], offset[1], self.encryption_options)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 52, in process_content
    content = b"".join(list(data))
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 168, in __next__
    chunk = next(self.iter_content_func)
  File "/Users/me/xxx/venv/lib/python3.9/site-packages/requests/models.py", line 765, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='xxx.blob.core.windows.net', port=443): Read timed out.

Process finished with exit code 1
janluke commented 2 years ago

Any update on this?

maulberto3 commented 2 years ago

I am also having issues downloading a Dataset from my blob into my azureml notebook. It just won't complete for a ~10Kb sample dataframe. Any help on this? It was working this morning, but not anymore this afternoon. It's as if azureml won't connect to the blob, or/and viceversa.

janluke commented 2 years ago

Sorry to be that guy, but how come this has not been fixed yet? Being able of downloading a dataset is the most basic feature of a data registry. Some of my colleagues experience the same problem with the model registry, so we're not always able to download models to try them locally.

anliakho2 commented 2 years ago

Hi @janluke, sorry this still happens to you. This is definitely one of the core scenarios and should work very reliably. What version of azureml-dataprep you have installed? Also I know I have asked you this before, but could you please also provide session_id for the recent failure?

ghost commented 2 years ago

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!