Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.8k stars 2.94k forks source link

Tensorflow alluxio python sdk failed to load big files #13900

Open LuQQiu opened 3 years ago

LuQQiu commented 3 years ago

Alluxio Version:

Describe the bug When reading a big file using alluxio python sdk, the python streaming function (tensorflow application) will read some bytes, rest for a while, and then read some bytes, and then rest for a while. If the application rest for about 40 seconds, the following read will not be successful. Looks like the fuse read connection is broken. When reading small files, no issue occur.

import json
import sys
import alluxio
from alluxio import option
client = alluxio.Client('<alluxio_host>', <alluxio_port>)

with client.open('/path/to/1GB/file', 'r') as f:
    a = f.read(100)
    import time
    for i in range(40):
        time.sleep(1)
        print(f'{i}  ', end='\r')
    a = f.read(1024 * 1024 * 1024 * 2)
    print(f"finish, {len(a)}")

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior A clear and concise description of what you expected to happen.

Urgency Describe the impact and urgency of the bug.

Additional context Add any other context about the problem here.

LuQQiu commented 3 years ago

Did a testing on local mac, this issue can be reproduce with pip3 install alluxio
pip3 install -r /alluxio-py/requirements.txt
python3 test.py

import json
import sys
import alluxio
from alluxio import option
client = alluxio.Client('localhost', 39999)

with client.open('/alluxio-2.7.0-SNAPSHOT-client.jar', 'r') as f:
    a = f.read(100)
    import time
    for i in range(40):
       time.sleep(1)
       print(f'{i}  ', end='\r')
    a = f.read(1024 * 1024 * 1024 * 2)
    print(f"finish, {len(a)}")

The file is only 27MB.

when it counts to 39, the following error occur

Traceback (most recent call last):
  File "/Users/alluxio/alluxioFolder/alluxio-py/test.py", line 13, in <module>
    a = f.read(1024 * 1024 * 1024 * 2)
  File "/Users/alluxio/alluxioFolder/alluxio-py/alluxio/client.py", line 620, in read
    return self.response.raw.read(num)
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 461, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/Cellar/python@3.9/3.9.0_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 380, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(524188 bytes read)', IncompleteRead(524188 bytes read))

I add logs to StreamsRestServiceHandler.read() and close(). In our code, i see the open returns the alluxio.client.file.FileInStream. After the error shown, the log shows the close() is called to invalidate the file in stream cache.

I doubt some code in alluxio-py timeout instead of code in proxy timeout. There is no timeout logics in proxy code.

LuQQiu commented 3 years ago

https://github.com/tweepy/tweepy/issues/908 alluxio-py has some dependencies that may raise this issue. We didn't timeout from the alluxio proxy side or alluxio-py side

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.