Tinche / aiofiles

File support for asyncio
Apache License 2.0
2.76k stars 150 forks source link

Iterating over a very large file (or files) using `readline` causing `queue.SimpleQueue.get(block=True)` to raise `queue.Empty`. #183

Closed kaorihinata closed 3 months ago

kaorihinata commented 3 months ago

I'm running into some very strange behavior when iterating over very large files using readline to pull out rows in chunks (this is part of some code I've inherited from somewhere else, and would be more than happy to rewrite the rest of it if we can find out what's going on.)

Essentially, I've extended the parent class with the following:

class ChunkedAsyncTextReader(AsyncTextReader):

    async def _take_at_most(self, count):
        result = []
        for _ in range(count):
            line = await self.stream.readline()
            if not line:
                break
            result += [self.parser.parse_line(line.rstrip(b"\n"))]
        return result

    async def read_records_chunked(
        self, count: int = 1000
    ) -> AsyncGenerator[tuple, None]:
        result = await self._take_at_most(count)
        while result:
            yield result
            result = await self._take_at_most(count)

self.stream above is the result of an aiofiles.open(filepath, "rb"), a aiofiles.threadpool.binary.AsyncBufferedReader. After reading thousands of lines there's about a 70% chance it will end as follows:

...
entering self.stream.readline
CRITICAL:concurrent.futures:Exception in worker
Traceback (most recent call last):
  File "/Users/nn/.sandbox/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 81, in _worker
    work_item = work_queue.get(block=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
_queue.Empty

At a glance, this seems impossible. The documentation states that timeout should be None by default, and a value of True for block should mean that get then waits indefinitely for a value to become available, but here it's clearly returning Empty, and I cannot figure out why. The pure Python implementation of SimpleQueue in query.py at least seems to conform to this expectation, as does the C implementation which is being used above, but the issue always occurs in aiofiles.threadpool.binary.AsyncBufferedReader's readline coroutine, or at least, the code never reaches the line after it.

Any help you can provide to figure out why this is happening would be appreciated. I apologize if I've made some incorrect assumptions above. I've spent most of the day just trying to determine who I should be filing an issue with, and I'm honestly still not 100% clear I'm in the right place as technically this is stopping in cpython code.

Opering system is macOS 15.0. Python verison is 3.11.9. aiofiles version is 23.2.1.

kaorihinata commented 3 months ago

Closing this as (probably) invalid. After more testing, this seems like it may be just be a poorly tested skeleton in the closet of the arm64 build of cpython, and probably not aiofiles' problem. 3.9.x was fine. Then 3.10.x, and 3.11.x were unstable. Then 3.12.x went back to being fine (api changes in 3.12.4 notwithstanding.)