MKuranowski / aiocsv

Python: Asynchronous CSV reading/writing
https://pypi.org/project/aiocsv/
MIT License
67 stars 9 forks source link

Doesn't handle quoted newlines #5

Closed AudriusButkevicius closed 3 years ago

AudriusButkevicius commented 3 years ago

Given a file test.csv:

h1,h2,h3
foo1,bar1,"baz: 1
nuff: 1
"
foo2,bar2,"baz: 2
nuff: 2
"

and the following test case:

import csv
import aiofiles
import aiocsv

path = "test.csv"

print("SYNC")
with open(path, encoding="utf-8", mode="r", newline="") as fd:
    for record in csv.DictReader(fd, dialect="unix"):
        print(record)

print("ASYNC")
async with aiofiles.open(path, encoding="utf-8", mode="r", newline="") as afd:
    async for record in aiocsv.AsyncDictReader(afd, dialect="unix"):
        print(record)

outputs:

SYNC
OrderedDict([('h1', 'foo1'), ('h2', 'bar1'), ('h3', 'baz: 1\nnuff: 1\n')])
OrderedDict([('h1', 'foo2'), ('h2', 'bar2'), ('h3', 'baz: 2\nnuff: 2\n')])

ASYNC
OrderedDict([('h1', 'foo1'), ('h2', 'bar1'), ('h3', 'baz: 1\n')])
OrderedDict([('h1', 'nuff: 1'), ('h2', None), ('h3', None)])
OrderedDict([('h1', '\n'), ('h2', None), ('h3', None)])
OrderedDict([('h1', 'foo2'), ('h2', 'bar2'), ('h3', 'baz: 2\n')])
OrderedDict([('h1', 'nuff: 2'), ('h2', None), ('h3', None)])
OrderedDict([('h1', '\n'), ('h2', None), ('h3', None)])

The problem stems from the fact that _read_until doesn't handle quoting, and the buffer does not contain the next line when csv.reader asks for it, so it assumes the end of file is reached.

MKuranowski commented 3 years ago

That's a very interesting issue, and I would say it is caused by the whole line-by-line buffering design.

Since that's wrong I guess I'll have to change the way all readers operate. I have an idea and I will see if it works over the next few days.

whg517 commented 3 years ago

Hi @MKuranowski .

When I read data using the built-in CSV, I did not set 'dialect'. The newline character of the data was set to '\n'. My data should be able to read normally.

with open(data_file, 'r', encoding='GBK') as d_obj:
    for line in csv.DictReader(d_obj):
        print(line)
async with aiofiles.open(self.datafile, 'r', encoding='GBK') as f_obj:
    async for line in aiocsv.AsyncDictReader(f_obj):
        print(line)

I saw through the Debug, csv.reader.dialect.lineterminator = "\r\n" .

But when I replace the method with aiocsv, I can't output data. Through Debug, I found that in '_read_until', because there is no '\r\n' in the data, so the empty string is returned.

I can manually set dialect="unix" to make the data program run properly. But I think there must be some logic that is not complete, including the problems reflected above.

Since the built-in CSV 'reader' method is written in C, I haven't had much time to review its implementation logic, so I can only give feedback on usage. If I have more time later, I think I'll help refine this part of the logic.

Finally, thank you for making this library, and I hope to improve the functions of this part as soon as possible.

MKuranowski commented 3 years ago

@whg517

I did not set 'dialect'. The newline character of the data was set to '\n'. Well, that's a confusing thing about Python's csv. It defaults to expecting CRLF terminators, but then it just ignores them and recognizes whatever (https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator).

Aiocsv actually uses dialect.lineterminator to detect row ends, which is, as noted in this issue, incorrect. Since I don't want to re-implement CSV parsing on my own; the issue here sort of stems from the fact that you can't really call async code from sync code.

When csv.reader would like to get more data from the file, it calls file.__iter__, which is synchronous; and I can't really hook that into an asynchronous file-like object.

whg517 commented 3 years ago

Hi @MKuranowski .

Thank you for your reply. I understand what you mean.

Functionally, there is nothing wrong with it.

MKuranowski commented 3 years ago

I gave in and implemented my own parser in Cython, which should work exactly like the CPython one - and therefore handle newlines in exactly the same way.

Unfortunately, as a consequence, the line_num attribute doesn't work.

Download aiocsv >= 1.2.0.