deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Change error mode to "ignore" and decrease reliance on chardet.detect() in decoding #285

Closed mevers303 closed 3 years ago

mevers303 commented 5 years ago

Made it use utf-8 for decoding unless chardet.detect has a certainty level above 80%. It is often wrong for large PDFs and chooses the wrong charset to decode, and subsequently errors. Also changed the error mode to 'ignore.' This makes it much more robust when you know all your documents are going to be in English.

traverseda commented 4 years ago

I suspect this would fix #337 and #338

Would it make sense to try the chardet.detect() method and then fall back to utf-8? or the other way around maybe?

traverseda commented 4 years ago

@jpweytjens @deanmalmgren

Any thoughts on this? Right now textract is failing with the error on a huge amount of content, and it looks like this fixes it. Would be nice to get some insight from someone with commit access.

traverseda commented 3 years ago

@mevers303 I'm now a maintainer for textract, if you'd take a second look at this I'd appreciate it.

deanmalmgren commented 3 years ago

Sorry I missed the dialog on this one, @traverseda. This seems like a reasonable approach to me. Thanks for taking care of that one.

On Sun, Aug 15, 2021 at 11:26 AM traverseda @.***> wrote:

Closed #285 https://github.com/deanmalmgren/textract/pull/285.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deanmalmgren/textract/pull/285#event-5159908822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB6NOHHWTA52AOFSPSYUKLT47TCTANCNFSM4HPTMX5A .

traverseda commented 3 years ago

No worried, I've merged in equivalent code in #393