Closed janluke closed 2 years ago
Thanks for the feedback, we’ll investigate asap.
@janluke I believe in most cases this would happen due to network stability issues. Looks like the method doesn't overwrite files by default, so running it a second time should do I suppose but I suppose a resume option would be better.
We are routing this to the team concerned for more insights.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.
Author: | janluke |
---|---|
Assignees: | PramodValavala-MSFT |
Labels: | `bug`, `Machine Learning`, `Service Attention`, `Client`, `customer-reported`, `needs-team-attention`, `CXP Attention` |
Milestone: | - |
@PramodValavala-MSFT The issue was consistently reproduced on three different machines, OSes and networks connections. None of us has been able to use this method to download a dataset of a few GB locally even a single time. If network stability is the issue, I'm afraid the network stability requirement of this method is too high for most people :)
@azureml-github Can you please help here.
@janluke - Do you know if this is a file dataset or a tabular dataset that you're trying to download?
File dataset. A folder.
@janluke Sorry for the delay in getting back to you. This is definitely neither expected nor something that we have seen before. Given that you can repro this across machines and users, I think the issue is in your specific set of files and setup. To help investigate this better could you please share some additional info with me:
dataset._dataflow._steps
in your envronmentfrom azureml._base_sdk_common import _ClientSessionId
print(_ClientSessionId)
Thanks !
- What is the datastore type used in this scenario (Azure Blob, Azure Data Lake Gen 2, Azure File Share)?
Azure Blob Storage.
- how is the file dataset defined? Ideally, you could post output of dataset._dataflow._steps in your environment
Can't run that now and I'm not sure I understood the question. The dataset was created by using the Azure ML Studio web interface ("create dataset from datastore").
- what is the structure of the source datastore in terms of number of files and folders?
Number of files seems irrelevant. The real dataset looks like this: there's a folder containing 50 html files with analysis (1.5 MB total). But I tried downloading a dataset containing only the 1.3 GB file and had the same problem.
- have you configured any VPNs, custom proxys or credential-less datastores?
No
- and last but not least please share you telemtry session id by running before a new attempt at download (would let us check telemetry on our side)
ef778d69-6b7d-4412-8031-29936b4ee656
I saved the debug-level log. If it doesn't contain any sensitive information I'll share it in a gist (I'll replace workspace information with placeholder anyway).
Hey, I'm currently debugging a random related issue, I'm having when downloading large files (>10GB) in concurrent mode (4 threads), that will hang on the last chunk.
While debugging a download of a large file, I had some small VPN hickups at 37%. The download progress stopped and is not resuming again.
When I pause and connect the debugger, all threads in the pool of StorageStreamDownloader.readinto(...)
are in a non-suspended state: Frames not available in non-suspended state
.
I suspect that at some point the sdk is waiting for data to receive, but because of the short connection loss it will never receive any.
Further I suspect that the custom HTTP client of the Azure SDK is not able to recover from such failure state and keeps waiting for new data from the broken socket. Can someone confirm?
Uff,....
Is it maybe a thread-safety issue? Could explain that it sometimes happens randomly....
class RequestsTransport(HttpTransport):
"""Implements a basic requests HTTP sender.
Since requests team recommends to use one session per requests, you should
not consider this class as thread-safe, since it will use one Session
per instance.
"""
That is called in multiple threads in: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/azure/storage/blob/_download.py#L612
if parallel:
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(self._max_concurrency) as executor:
list(executor.map(
with_current_context(downloader.process_chunk),
downloader.get_chunk_offsets()
))
Also, why is the read_timeout 22 hours?
READ_TIMEOUT = 20
# for python 3.5+, there was a change to the definition of the socket timeout (as far as socket.sendall is concerned)
# The socket timeout is now the maximum total duration to send all data.
if sys.version_info >= (3, 5):
# the timeout to connect is 20 seconds, and the read timeout is 80000 seconds
# the 80000 seconds was calculated with:
# 4000MB (max block size)/ 50KB/s (an arbitrarily chosen minimum upload speed)
READ_TIMEOUT = 80000
The requests timeout parameter is set to timeout =(20, 80000)
.
From https://docs.python-requests.org/en/master/user/advanced/#timeouts :
Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it’s the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).
I confirmed it by setting the read_timeout
to 300 seconds. The read operation on the last chunk is hanging and blocking the whole download. There is a deadlock!
Downloading 36b90544 (1/1): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 9.83G/9.84G [22:25<00:01, 7.85MB/s]
Traceback (most recent call last):
File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 519, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 463, in read
n = self.readinto(b)
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 507, in readinto
n = self.fp.readinto(b)
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/me/xxx/venv/lib/python3.9/site-packages/requests/models.py", line 758, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 576, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 541, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/urllib3/response.py", line 443, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='xxx.blob.core.windows.net', port=443): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "... xxx ..."
blob_client.download_blob(read_timeout=300, **downloader_kwargs).readinto(dst_fp)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 613, in readinto
list(executor.map(
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.__get_result()
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/usr/local/Cellar/python@3.9/3.9.8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 129, in process_chunk
chunk_data = self._download_chunk(chunk_start, chunk_end - 1)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 212, in _download_chunk
chunk_data = process_content(response, offset[0], offset[1], self.encryption_options)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/storage/blob/_download.py", line 52, in process_content
content = b"".join(list(data))
File "/Users/me/xxx/venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 168, in __next__
chunk = next(self.iter_content_func)
File "/Users/me/xxx/venv/lib/python3.9/site-packages/requests/models.py", line 765, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='xxx.blob.core.windows.net', port=443): Read timed out.
Process finished with exit code 1
Any update on this?
I am also having issues downloading a Dataset from my blob into my azureml notebook. It just won't complete for a ~10Kb sample dataframe. Any help on this? It was working this morning, but not anymore this afternoon. It's as if azureml won't connect to the blob, or/and viceversa.
Sorry to be that guy, but how come this has not been fixed yet? Being able of downloading a dataset is the most basic feature of a data registry. Some of my colleagues experience the same problem with the model registry, so we're not always able to download models to try them locally.
Hi @janluke, sorry this still happens to you. This is definitely one of the core scenarios and should work very reliably. What version of azureml-dataprep you have installed? Also I know I have asked you this before, but could you please also provide session_id for the recent failure?
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
Describe the bug I tried to use
Dataset.download()
method to download a registered dataset (made of multiple files) in my personal computer (Windows 10, ~50Mbps connection). For small test datasets (a few MBs), it works as expected. For bigger datasets (~3GB) the download hangs or it terminates after a long time with no exception or logging errors. Furthermore, some of the files are missing in the target folder. The same happens in the macOS laptop of my colleague.Everything works properly in my Azure ML virtual machine (running Linux).
To Reproduce
Try to download it with
Expected behavior The dataset is downloaded to "your/target/path".