MKuranowski / aiocsv

Python: Asynchronous CSV reading/writing
https://pypi.org/project/aiocsv/
MIT License
67 stars 9 forks source link

Decode bytes to unicode if reader receives bytes instead of string #6

Closed ben-pearce closed 3 years ago

ben-pearce commented 3 years ago

This pull request adds support for decoding bytes sent to the asynchronous parser.

This is useful in cases where you have no control over the type of data provided by the async reader object.

For example when reading a csv from network using aiohttp the content arrives as type bytes which needs to be decoded to Unicode:

async def test_network_read():
    async with aiohttp.ClientSession() as session:
        async with session.get(REMOTE_FILENAME) as response:
            read_rows = [i async for i in AsyncReader(response.content)]
            assert read_rows == VALUES

Currently, this will result in the following TypeError being raised: TypeError: Expected unicode, got bytes

But this pull request will decode the bytes as they're being read from the stream, allowing for more performance as data is parsed while the stream is being read.

MKuranowski commented 3 years ago

Currently, this will result in the following TypeError being raised: TypeError: Expected unicode, got bytes

Yes, that's intentional, and that's how the built-in csv module works.

See issue #2.

I'm sorry, but that's just not a place to do the decoding.

In synchronous Python one can use io.TextIOWrapper to wrap a binary stream into a text stream with any hassle. Writing such a wrapper for async is also a matter of a few lines. That's the reason I'm also closing the PR without merging.


Also, I have some feedback on the PR itself.

  1. It doesn't allow to specify the encoding and encoding error handling.
  2. The feature set for reading and writing should be symmetric, and this PR only adds support for reading from a binary stream.
  3. It's unsafe to call decode on arbitrarily sliced chunks. The parser in your PR reads in 1024-byte chunks, and there's no guarantee that it won't stop reading in the middle of a code point.

    # Let's pretend that that's our stream, that will be read in chunks of 7.
    a = "いろはにほへとちりぬるをわかよたれそつねならむうゐのおくやまけふこえてあさきゆめみしゑひもせす"
    
    # As a string, there are no issues with chunk reading
    for i in range(0, len(a), 7):
      a[i:i+7]
    
    # Let's now try to read this as a binary stream in 7-byte chunks
    b = a.encode("utf-8")
    for i in range(0, len(b), 7):
      a[i:i+7].decode("utf-8")
    # A UnicodeDecodeError is thrown
  4. Why, oh why, is internet needed to test those changes? This looks super odd and suspicious. aiocsv doesn't do anything that's got to do with internet communication, so why would it need aiohttp to test these changes? Hardcoding a path to GitHub in the tests will force everyone to change the tests when forking. The tests will fail when GitHub is down, or if I decide to move this repository.
ben-pearce commented 3 years ago

Alright, I noted the issues with my PR & thanks for your feedback.