Closed decaz closed 2 months ago
Yes, sure. Would you make a PR?
Unfortunately right now I don't have time to, maybe later. I've created issue just not to forget about it.
@asvetlov @decaz I was taking a quick look into this and have some open questions. As far as I see the read
method of ClientResponse
has already read the entire body and put it into the memory. the get_encoding
method just make use of that in-memory body right?
So what is your expectation of this PR? Is it about changing the behaviour of read
so it return a stream or just splitting the already in-memory body into chunks and feed that into the UniversalDetactor
?
We can replace 'resp.read()with
resp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).
content.read()is not stored in
resp._body`, so we need more complex buffering strategy.
We can replace 'resp.read()
with
resp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).
content.read()is not stored in
resp._body`
At the time of @decaz 's original PR, await self.read()
is always called if the encoding is not detected, i.e. there is no charset
in the Content-Type
response header from the web server, so _content
(aka _body
) will always be populated when chardet.detect
is called,
However, as of now, get_encoding()
always assumes _body is not None
, i.e. read()
, json()
, or text()
must be called first before it will work, or else it will throw an error if charset
isn't in the Content-Type
response header.
so we need more complex buffering strategy.
ClientResponse.read()
has already been called then we can just do detection on the whole body, no issue._body is None
)...
resp.content.read(n=2048)
and append to a self._partial_body = bytearray()
in ClientResponse
until charset is detected or we reach EOF?text()
. Inside read()
, is all we need to do self._body = self._partial_body.append(self.content.read())
? Because as the name of StreamReader
suggests, read()
will continue where we left off?Is this the strategy you are thinking of?
Sorry for misleading in my previous message. The change would be the following:
encoding = chardet.detect(memoryview(self._body)[:4096])['encoding']
Did I miss something?P.S.
As I see get_encoding()
should be a private method actually but this is another issue.
Memoryview cannot be passed into cchardet
(the optimized cython version).
File "/home/transfusion/python38/lib/python3.8/site-packages/cchardet/__init__.py", line 35, in feed self._detector.feed(data) TypeError: Argument 'msg' has incorrect type (expected bytes, got memoryview)
I had to use https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes , which has the same effect as directly slicing the bytes
object since it copies the data.
I have done a quick test with memory_profiler
by passing some webpages incrementally into UniversalDetector()
, with cchardet 2.1.4
and chardet 3.0.4
.
from memory_profiler import profile
try:
import cchardet as chardet
except ImportError:
import chardet
import urllib.request
global _bytes
@profile
def chunked_with_memoryview(chunk_size: int = 2 * 1024):
detector = chardet.UniversalDetector()
_body_memoryview = memoryview(_bytes)
# print(len(_bytes) == len(_body_memoryview))
for i in range((len(_body_memoryview) // chunk_size) + 1):
_chunk = _body_memoryview[i * chunk_size: (i + 1) * chunk_size]
_chunk_bytes = _chunk.tobytes()
detector.feed(_chunk_bytes)
del _chunk_bytes
if detector.done:
print("chunk " + str(i) + " reached")
break
detector.close()
print(detector.result)
@profile
def without_chunking():
print(chardet.detect(_bytes)['encoding'])
if __name__ == "__main__":
global _bytes
# SHIFT-JIS, GB18030, KOI8-R, UHC/EUC-KR
for website in ['https://www.jalan.net/', 'http://ctrip.com',
'http://nn.ru', 'https://www.incruit.com/']:
response = urllib.request.urlopen(website)
_bytes = response.read()
chunked_with_memoryview()
without_chunking()
with cchardet: https://gist.github.com/Transfusion/b2075b7a08863c3e5b5afc96b119c29d with chardet: https://gist.github.com/Transfusion/be47f38522f0a22e6dce6af95243082b
I neither see any significant decrease nor increase in memory usage with cchardet.
There is some small improvement (< 2 MB) if cchardet does not exist and instead falls back to chardet (which btw can work with memoryview
). However, the same improvement can be had by simply installing cchardet, unless for some reason the platform doesn't support it.
I'm curious what is the performance improvement if any?
The recommended libraries for charset encoding seem to outperform these libraries, plus charset detection seems to only be used when we already have the full body, so I don't think there's any value in looking at this any further.
At #2549 @asvetlov proposed useful improvement of detection of content encoding to prevent reading the whole response in memory by allowing to specify max data size for sniffing.
UniversalDetector
can be used: https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encoding-incrementally.