Memory efficient encoding detection

decaz commented 5 years ago

At #2549 @asvetlov proposed useful improvement of detection of content encoding to prevent reading the whole response in memory by allowing to specify max data size for sniffing.

UniversalDetector can be used: https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encoding-incrementally.

asvetlov commented 5 years ago

Yes, sure. Would you make a PR?

decaz commented 5 years ago

Unfortunately right now I don't have time to, maybe later. I've created issue just not to forget about it.

icmpecho commented 5 years ago

@asvetlov @decaz I was taking a quick look into this and have some open questions. As far as I see the read method of ClientResponse has already read the entire body and put it into the memory. the get_encoding method just make use of that in-memory body right? So what is your expectation of this PR? Is it about changing the behaviour of read so it return a stream or just splitting the already in-memory body into chunks and feed that into the UniversalDetactor ?

asvetlov commented 5 years ago

We can replace 'resp.read()withresp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though). content.read()is not stored inresp._body`, so we need more complex buffering strategy.

Transfusion commented 5 years ago

We can replace 'resp.read()withresp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).content.read()is not stored inresp._body`

At the time of @decaz 's original PR, await self.read() is always called if the encoding is not detected, i.e. there is no charset in the Content-Type response header from the web server, so _content (aka _body) will always be populated when chardet.detect is called,

However, as of now, get_encoding() always assumes _body is not None, i.e. read(), json(), or text() must be called first before it will work, or else it will throw an error if charset isn't in the Content-Type response header.

https://github.com/aio-libs/aiohttp/blob/a54956d94c21bfbea74153867e182fe47cf12c6d/aiohttp/client_reqrep.py#L927-L944

so we need more complex buffering strategy.

If ClientResponse.read() has already been called then we can just do detection on the whole body, no issue.
If it hasn't (_body is None)...
- Repeatedly call resp.content.read(n=2048) and append to a self._partial_body = bytearray() in ClientResponse until charset is detected or we reach EOF?
- Let's say detection is successful after i read 4096 bytes; the 2nd call. Now I want to read the full response, so I call text(). Inside read(), is all we need to do self._body = self._partial_body.append(self.content.read()) ? Because as the name of StreamReader suggests, read() will continue where we left off?

Is this the strategy you are thinking of?

asvetlov commented 5 years ago

Sorry for misleading in my previous message. The change would be the following:

Read the entire body as we do it right now.
Pass only first 4KiB to chardet. Memoryview can be used: encoding = chardet.detect(memoryview(self._body)[:4096])['encoding'] Did I miss something?

P.S. As I see get_encoding() should be a private method actually but this is another issue.

Transfusion commented 5 years ago

Memoryview cannot be passed into cchardet (the optimized cython version). File "/home/transfusion/python38/lib/python3.8/site-packages/cchardet/__init__.py", line 35, in feed self._detector.feed(data) TypeError: Argument 'msg' has incorrect type (expected bytes, got memoryview)

I had to use https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes , which has the same effect as directly slicing the bytes object since it copies the data.

I have done a quick test with memory_profiler by passing some webpages incrementally into UniversalDetector(), with cchardet 2.1.4 and chardet 3.0.4.

from memory_profiler import profile
try:
    import cchardet as chardet
except ImportError:
    import chardet
import urllib.request
global _bytes

@profile
def chunked_with_memoryview(chunk_size: int = 2 * 1024):
    detector = chardet.UniversalDetector()
    _body_memoryview = memoryview(_bytes)
    # print(len(_bytes) == len(_body_memoryview))
    for i in range((len(_body_memoryview) // chunk_size) + 1):
        _chunk = _body_memoryview[i * chunk_size: (i + 1) * chunk_size]
        _chunk_bytes = _chunk.tobytes()
        detector.feed(_chunk_bytes)
        del _chunk_bytes
        if detector.done:
            print("chunk " + str(i) + " reached")
            break

    detector.close()
    print(detector.result)

@profile
def without_chunking():
    print(chardet.detect(_bytes)['encoding'])

if __name__ == "__main__":
    global _bytes

    # SHIFT-JIS, GB18030, KOI8-R, UHC/EUC-KR
    for website in ['https://www.jalan.net/', 'http://ctrip.com',
                    'http://nn.ru', 'https://www.incruit.com/']:
        response = urllib.request.urlopen(website)
        _bytes = response.read()

        chunked_with_memoryview()
        without_chunking()

with cchardet: https://gist.github.com/Transfusion/b2075b7a08863c3e5b5afc96b119c29d with chardet: https://gist.github.com/Transfusion/be47f38522f0a22e6dce6af95243082b

I neither see any significant decrease nor increase in memory usage with cchardet. There is some small improvement (< 2 MB) if cchardet does not exist and instead falls back to chardet (which btw can work with memoryview). However, the same improvement can be had by simply installing cchardet, unless for some reason the platform doesn't support it.

asvetlov commented 5 years ago

I'm curious what is the performance improvement if any?

Dreamsorcerer commented 2 months ago

The recommended libraries for charset encoding seem to outperform these libraries, plus charset detection seems to only be used when we already have the full body, so I don't think there's any value in looking at this any further.

aio-libs / aiohttp

Memory efficient encoding detection #4112