ValueError when reader reads more than requested

dnikolayev commented 6 months ago

Hello,

first of all, thanks for your work and your time!

Python 3.12 Such code works perfectly with aiofile (not aiofiles) and liabaio-1 (linux async support) on aiocsv 1.2.5, however 1.3.0 returns ValueError.

Code example:

    async with aiofile.async_open('support/geo_city_public.csv', 'r', encoding='utf-8-sig') as afp:
        async for row in AsyncDictReader(afp, delimiter=";"):

Error on Linux(Debian LTS):

    async for row in AsyncDictReader(afp, delimiter=";"):
  File "......./venv312/lib/python3.12/site-packages/aiocsv/readers.py", line 90, in __anext__
    self.fieldnames = await self.reader.__anext__()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "......./venv312/lib/python3.12/site-packages/aiocsv/readers.py", line 43, in __anext__
    return await self._parser.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: reader has read 16381 bytes, which is more than the requested 8192 bytes

On Mac OS - 1.3.0 works, but I was able to reproduce the error by removing encoding='utf-8-sig' from the async_open. In both cases - the version of aiofile is the same; however on MacOS I had +1 byte report (ValueError: reader has read 16382 bytes, which is more than the requested 8192 bytes). No sure if replication of ValueError on MacOs relates to the original problem.

For now, I reverted to aiocsv 1.2.5 - the code works perfectly on huge CSV files (2-3GBs). Please let me know if you need anything else (to test, for instance).

Thanks!

MKuranowski commented 6 months ago

This looks like an issue with aiofile, the problem is just as the raised ValueError indicates.

The documentation for read in Python clearly states:

Read and return at most size characters

aiofile has returned more than the requested size characters.

aiocsv essentially invokes await reader.read(io.DEFAULT_BUFFER_SIZE), and then rewrites the result to a C array Py_UCS4 buffer[io.DEFAULT_BUFFER_SIZE], because Python strings don't have a uniform representation in C. If there are more characters than io.DEFAULT_BUFFER_SIZE, then the rewrite would cause a buffer overflow, or would have to be truncated.

Fortunately, buffer is dynamically allocated, I can workaround the issue by resizing it.

dnikolayev commented 6 months ago

Thanks for your fast response.

I can mention this issue to aiofile too..

aiocsv 1.2.5 didn't have such check(raise ValueError), so it worked well? (the problem with aiofile existed before?)

MKuranowski commented 6 months ago

aiocsv 1.2.5 didn't have such check

Yes

so it worked well?

Both yes and no. With 1.3.0 I rewrote the parser in C. This has fixed #19, and should make issues like #17 and #10 less likely.

(the problem with aiofile existed before?)

Most likely yes

dnikolayev commented 6 months ago

Sounds promising with C - this could improve the speed for big files. Please let me know if the issue could be resolved on your side, or I will need to ping aiofile author to help :)

MKuranowski commented 6 months ago

Just published 1.3.1 with a fix

dnikolayev commented 6 months ago

Hello,

Thanks a lot! Just test and I can confirm 1.3.1 works with aiofile on Linux on caio/libaio for native linux async file access support :) Again, Thanks a lot.

I will also do a more extensive run on substantial data files. Do you think there will be a speed improvement with the C part in aiocsv parse-loop?

MKuranowski commented 6 months ago

aiocsv has always used something compiled for parsing CSV. Until 1.2.0, it was deferring to builtin csv module, but that was plagues by #5. 1.2.0 introduced a parser written in Cython, and in 1.3.0 I have reworked the parser in pure C.

I did have a q&d benchmark ( for _ in csv.reader(big_file): pass) and I would say that:

1.2.5 Cython parser was ~3.9 times slower than synchronous csv
1.3.0 C parser was ~3.1 times slower than sync csv
1.3.1 C parser is ~3.2 times slower than sync csv

The codebase also has a Python implementation of the parser, but it so bad that it should never be used. In case it somehow is, a warning is emitted.

1.2.5 Python parser was ~24 times slower than sync csv
1.3.1 Python parser is ~50 times slower than sync csv

dnikolayev commented 6 months ago

Thanks, Mikołaj! Cool. I appreciate your effort with this module. For some tasks, doing an async way is much more helpful, even with slower processing, as it "wins" CPU time for some other code.

MKuranowski / aiocsv

ValueError when reader reads more than requested #24