MKuranowski / aiocsv

Python: Asynchronous CSV reading/writing
https://pypi.org/project/aiocsv/
MIT License
67 stars 9 forks source link

ValueError when reader reads more than requested #24

Closed dnikolayev closed 6 months ago

dnikolayev commented 6 months ago

Hello,

first of all, thanks for your work and your time!

Python 3.12 Such code works perfectly with aiofile (not aiofiles) and liabaio-1 (linux async support) on aiocsv 1.2.5, however 1.3.0 returns ValueError.

Code example:

    async with aiofile.async_open('support/geo_city_public.csv', 'r', encoding='utf-8-sig') as afp:
        async for row in AsyncDictReader(afp, delimiter=";"):

Error on Linux(Debian LTS):

    async for row in AsyncDictReader(afp, delimiter=";"):
  File "......./venv312/lib/python3.12/site-packages/aiocsv/readers.py", line 90, in __anext__
    self.fieldnames = await self.reader.__anext__()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "......./venv312/lib/python3.12/site-packages/aiocsv/readers.py", line 43, in __anext__
    return await self._parser.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: reader has read 16381 bytes, which is more than the requested 8192 bytes

On Mac OS - 1.3.0 works, but I was able to reproduce the error by removing encoding='utf-8-sig' from the async_open. In both cases - the version of aiofile is the same; however on MacOS I had +1 byte report (ValueError: reader has read 16382 bytes, which is more than the requested 8192 bytes). No sure if replication of ValueError on MacOs relates to the original problem.

For now, I reverted to aiocsv 1.2.5 - the code works perfectly on huge CSV files (2-3GBs). Please let me know if you need anything else (to test, for instance).

Thanks!

MKuranowski commented 6 months ago

This looks like an issue with aiofile, the problem is just as the raised ValueError indicates.

The documentation for read in Python clearly states:

Read and return at most size characters

aiofile has returned more than the requested size characters.


aiocsv essentially invokes await reader.read(io.DEFAULT_BUFFER_SIZE), and then rewrites the result to a C array Py_UCS4 buffer[io.DEFAULT_BUFFER_SIZE], because Python strings don't have a uniform representation in C. If there are more characters than io.DEFAULT_BUFFER_SIZE, then the rewrite would cause a buffer overflow, or would have to be truncated.

Fortunately, buffer is dynamically allocated, I can workaround the issue by resizing it.

dnikolayev commented 6 months ago

Thanks for your fast response.

I can mention this issue to aiofile too..

aiocsv 1.2.5 didn't have such check(raise ValueError), so it worked well? (the problem with aiofile existed before?)

MKuranowski commented 6 months ago

aiocsv 1.2.5 didn't have such check

Yes

so it worked well?

Both yes and no. With 1.3.0 I rewrote the parser in C. This has fixed #19, and should make issues like #17 and #10 less likely.

(the problem with aiofile existed before?)

Most likely yes

dnikolayev commented 6 months ago

Sounds promising with C - this could improve the speed for big files. Please let me know if the issue could be resolved on your side, or I will need to ping aiofile author to help :)

MKuranowski commented 6 months ago

Just published 1.3.1 with a fix

dnikolayev commented 6 months ago

Hello,

Thanks a lot! Just test and I can confirm 1.3.1 works with aiofile on Linux on caio/libaio for native linux async file access support :) Again, Thanks a lot.

I will also do a more extensive run on substantial data files. Do you think there will be a speed improvement with the C part in aiocsv parse-loop?

MKuranowski commented 6 months ago

aiocsv has always used something compiled for parsing CSV. Until 1.2.0, it was deferring to builtin csv module, but that was plagues by #5. 1.2.0 introduced a parser written in Cython, and in 1.3.0 I have reworked the parser in pure C.

I did have a q&d benchmark ( for _ in csv.reader(big_file): pass) and I would say that:

The codebase also has a Python implementation of the parser, but it so bad that it should never be used. In case it somehow is, a warning is emitted.

dnikolayev commented 6 months ago

Thanks, Mikołaj! Cool. I appreciate your effort with this module. For some tasks, doing an async way is much more helpful, even with slower processing, as it "wins" CPU time for some other code.