deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Switch chardet for charset-normalizer #395

Closed Ousret closed 1 year ago

Ousret commented 3 years ago

Like briefly discussed on https://github.com/deanmalmgren/textract/pull/393 I bring this PR to propose the following:

Can address any of your concerns, if any.

coveralls commented 3 years ago

Coverage Status

Coverage decreased (-0.7%) to 91.176% when pulling 2e5b23f5288c5894845ebc227037d2a1e28f3d9e on Ousret:patch-chardet-migration into 902028f5d47ed40180202bbde59d4941863a6281 on deanmalmgren:master.

traverseda commented 3 years ago

Awesome, I'm not going to have a chance to really look at this until the weekend. I'd also like to make the last 1.6 release before making this change, as it's a pretty big change and I do consider this to be a feature.

traverseda commented 3 years ago

As an aside, the release this PR is going in does not need to be compatible with python2. The release before this is the last where we care about python2 support. Not sure if that changes anything for you, but I know there are some changes in the textract I made for python2 that I plan on undoing the next release, as they made the code uglier.

Ousret commented 3 years ago

As you requested, this PR no longer cares about the py2 BC.

Ousret commented 3 years ago

The tests fail and I am not sure why.

test_png.py FFF..
test_tiff.py FFF..
traverseda commented 3 years ago

I have some ideas, I'll take a look this weekend. Seems like some kind of issue related to newline handling.

You can run the tests locally with nosetests3 if memory serves.

jpweytjens commented 3 years ago

There are multiple libraries that perform better than chardet.

Some simple code I had laying around allows the users to choose among them.

import chardet
import bs4
import charset_normalizer
import ftfy
import magic

def decode(self, text, encoding, encoding_errors="strict"):
        """Decode the text if necessary with either the encoding detected by the desired encoding detection method, or by the manually provided encoding."""

        if isinstance(text, str):
            return text

        if encoding == "chardet":
            result = chardet.detect(text)
            original_encoding = result["encoding"]

        elif encoding == "unicodedammit":
            original_encoding = bs4.UnicodeDammit(text).original_encoding

        elif encoding == "charset_normalizer":
            result = charset_normalizer.detect(text)
            original_encoding = result["encoding"]

        elif encoding == "ftfy":
            original_encoding = ftfy.guess_bytes(text)[1]

        else:
            original_encoding = encoding

        return text.decode(original_encoding, encoding_errors)
Ousret commented 3 years ago

@jpweytjens

I am curious about how you arrived at the following conclusion:

There are multiple libraries that perform better than chardet.

Based on what data?

Why is ftfy on your list? It is not a charset detection library https://news.ycombinator.com/item?id=8188129 https://ftfy.readthedocs.io/en/latest/detect.html. You import magic without having any condition that leads to its usage. Where is cchardet? And so on... You mention UnicodeDammit from bs4, but do you know how is it built? https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/dammit.py#L2682

The only remaining competitors are chardet, cchardet (uchardet), magic (libmagic1), and charset_normalizer. There aren't many native solutions around here, chardet and charset_normalizer. These are more convenient to implement so users would not be stuck if the whl is not available for them.

Letting users choose can be a real end-goal solution, provided they offer the same interfaces. Like for brotli (brotli or brotlicffi) decoding.

If you have any data available, I would gladly look at them.

traverseda commented 3 years ago

Just as an aside, I'm very busy this week and likely won't have time to look at this in-depth until maybe the 4th.

It looks very promising though!

jpweytjens commented 3 years ago

@Ousret

I was not aware about the limitations of ftfy and beautifulsoup. My evidence is anecdotal. I did some limited testing with utf8 encoded txt files. Chardet incorrectly detected the decoding in 80%+ of the cases. A test set of document would be a good addition to check what method would be best for textract. I do think charset_normalizer does a much better job than chardet.