Closed Ousret closed 1 year ago
Awesome, I'm not going to have a chance to really look at this until the weekend. I'd also like to make the last 1.6 release before making this change, as it's a pretty big change and I do consider this to be a feature.
As an aside, the release this PR is going in does not need to be compatible with python2. The release before this is the last where we care about python2 support. Not sure if that changes anything for you, but I know there are some changes in the textract I made for python2 that I plan on undoing the next release, as they made the code uglier.
As you requested, this PR no longer cares about the py2 BC.
The tests fail and I am not sure why.
test_png.py FFF..
test_tiff.py FFF..
I have some ideas, I'll take a look this weekend. Seems like some kind of issue related to newline handling.
You can run the tests locally with nosetests3
if memory serves.
There are multiple libraries that perform better than chardet
.
Some simple code I had laying around allows the users to choose among them.
import chardet
import bs4
import charset_normalizer
import ftfy
import magic
def decode(self, text, encoding, encoding_errors="strict"):
"""Decode the text if necessary with either the encoding detected by the desired encoding detection method, or by the manually provided encoding."""
if isinstance(text, str):
return text
if encoding == "chardet":
result = chardet.detect(text)
original_encoding = result["encoding"]
elif encoding == "unicodedammit":
original_encoding = bs4.UnicodeDammit(text).original_encoding
elif encoding == "charset_normalizer":
result = charset_normalizer.detect(text)
original_encoding = result["encoding"]
elif encoding == "ftfy":
original_encoding = ftfy.guess_bytes(text)[1]
else:
original_encoding = encoding
return text.decode(original_encoding, encoding_errors)
@jpweytjens
I am curious about how you arrived at the following conclusion:
There are multiple libraries that perform better than chardet.
Based on what data?
Why is ftfy on your list? It is not a charset detection library https://news.ycombinator.com/item?id=8188129 https://ftfy.readthedocs.io/en/latest/detect.html. You import magic
without having any condition that leads to its usage. Where is cchardet
? And so on... You mention UnicodeDammit from bs4, but do you know how is it built? https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/dammit.py#L2682
The only remaining competitors are chardet
, cchardet
(uchardet), magic
(libmagic1), and charset_normalizer
.
There aren't many native solutions around here, chardet
and charset_normalizer
. These are more convenient to implement so users would not be stuck if the whl is not available for them.
Letting users choose can be a real end-goal solution, provided they offer the same interfaces. Like for brotli (brotli
or brotlicffi
) decoding.
If you have any data available, I would gladly look at them.
Just as an aside, I'm very busy this week and likely won't have time to look at this in-depth until maybe the 4th.
It looks very promising though!
@Ousret
I was not aware about the limitations of ftfy and beautifulsoup. My evidence is anecdotal. I did some limited testing with utf8 encoded txt files. Chardet incorrectly detected the decoding in 80%+ of the cases. A test set of document would be a good addition to check what method would be best for textract. I do think charset_normalizer does a much better job than chardet.
Like briefly discussed on https://github.com/deanmalmgren/textract/pull/393 I bring this PR to propose the following:
chardet
forcharset-normalizer
.Can address any of your concerns, if any.