Closed mementum closed 1 year ago
Hi @mementum
Thank You for the analysis - I must have done a mistake with the requirements file. And Yes, I'll be grateful if you can submit a PR for the cchardet
issue.
Please note anyway that the requirements.txt
file describes the development dependencies. Those include libraries used in integration tests like Flask
, requests
, etc. The setup.py
file describes the actual dependencies of blacksheep. Side note: already have in mind to migrate to pyproject.toml
, and I started doing it for other projects, but I didn´t take the time to look into packing Cython extensions that way, yet.
@mementum
It seems the problem is not only about cchardet
. It's the new version of pip
that doesn't accept packages without wheels. Also, the CI build in Windows doesn't work anymore because the --install-options
has been deprecated.
I'll need to look into this anyway - I was planning to move to pyproject.toml
as I did for other libraries, but I didn't follow the (build) news so much to know all details. I'll be grateful if you submit a PR for the cchardet
issue, I'll look into the other packages and into migrating to pyproject.toml
.
@mementum I hope you didn´t start preparing a PR in the last couple of hours, because I ended up fixing the build issue and upgrading several dependencies. After reading a bit about charset-normalizer
, I decided to replace altogether both chardet
and cchardet
with it. It looks ideal!
Sorry for the three mentions in my comments. Thank You for the heads up about charset-normalizer!
Fixed on main
- still need to fix on v1
branch and publish to PyPi.
Hi @mementum This has been fixed and released with
It wasn´t affecting only Windows, all platforms with Python 3.10 and 3.11, because of a breaking change in the latest pip
.
Thank You for reporting the issue and recommending charset-normalizer!
You could use https://github.com/faust-streaming/cChardet as active fork
After changing to charset-normalizer
, what does the benchmarks say ?
Its a pure Python library compared to cchardet
@techntools in this case performance is not an issue. The charset detection is only used when:
utf-8
(in practice, most content is encoded that way)content-type
request header)utf-8
or ISO-8859-1
BadRequest
) async def text(self):
body = await self.read()
if body is None:
return ""
try:
return body.decode(self.charset)
except UnicodeDecodeError:
# this can happen when the server returned a declared charset,
# but its content is not actually using the declared encoding
# a common encoding is 'ISO-8859-1', so before using chardet, we try with this
if self.charset != 'ISO-8859-1':
try:
return body.decode('ISO-8859-1')
except UnicodeDecodeError:
# fallback to trying to detect the encoding;
return body.decode(charset_normalizer.detect(body)['encoding'])
Issue https://github.com/Neoteroi/BlackSheep/issues/276 was opened for Python 3.11 but this also affects, for example installing under Windows with Python 3.10
The current determination of whether to use
cchardet
orcchardet
happens herehttps://github.com/Neoteroi/BlackSheep/blob/34ae49496fcb6e5809fd132fa63ba4776f900d54/blacksheep/messages.pyx#L7-L10
And it is then used here
https://github.com/Neoteroi/BlackSheep/blob/34ae49496fcb6e5809fd132fa63ba4776f900d54/blacksheep/messages.pyx#L157
The current
requirements.txt
is somehow conflicting, because it allows skipping installation ofcchardet
for Python3.11
and the same time says thatchardet
(the nonC
version, i.e.: pure python) is also requirement for all versions.https://github.com/Neoteroi/BlackSheep/blob/34ae49496fcb6e5809fd132fa63ba4776f900d54/requirements.txt#L3-L8
And
charset-normalizer
, as suggested in issue #276 should be used as a replacement forcchardet
. Surprisingly it is a requirement but it is not used.Suggested actions
Drop
cchardet
as a hard requirement for anything above3.9
. Because there are wheels up to that version. i.e. Changecchardet==2.1.7; python_version < '3.11'
tocchardet==2.1.7; python_version < '3.10'
Change the imports to also import
charset-normalizer
and use it. Thedetect
function ischardet
compatible.This is actually redundant given that
charset-normalizer
has been added a a hard requirement torequirements.txt
. A better approach would beAnd drop the pure python
chardet
altogether fromrequirements.txt
. Bothblacksheep
andcharset-normalizer
are good for python3.8 => 3.11
(which is the minimum common denominator, takingblacksheep
as the restriction)The
detect
portion of the code needs no change.Should you need a pull-request, don't hesitate to say: yes and I will submit one