deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Remove EbookLib dependency #411

Closed thehale closed 2 years ago

thehale commented 2 years ago

EbookLib carries an AGPL license which makes it incompatible with textract's MIT license.

This commit replaces EbookLib with a BSD-3 licensed library that parses ebook contents just as easily for us.

Note: ePubs are basically a collection of html documents. Extracting their text could technically be done without any special epub dependency.

Fixes deanmalmgren/textract#409

coveralls commented 2 years ago

Coverage Status

Coverage increased (+0.2%) to 92.15% when pulling e81913bca80c98cca396a021f96d77e90409c287 on jhale1805:non_agpl_epub_extractor into 902028f5d47ed40180202bbde59d4941863a6281 on deanmalmgren:master.

thehale commented 2 years ago

Note: ePubs are basically a collection of html documents. Extracting their text could technically be done without any epub dependency.

UPDATE: This has been accomplished in the latest passing commits on this PR. More details are in the commit messages for any who are interested.

thehale commented 2 years ago

@deanmalmgren Just following up to make sure you saw this PR...

deanmalmgren commented 2 years ago

Thanks for this PR @jhale1805