deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Transient AGPL Dependency `EbookLib` #409

Closed thehale closed 2 years ago

thehale commented 2 years ago

Describe the bug I currently work on a project that is using textract to help parse pdfs and word documents. It looks great and works well!

However, when I ran an analysis of its license dependencies, I discovered that textract depends on EbookLib which is AGPL licensed. Legally, that means that textract should also carry an AGPL license, as should all programs which use textract, unless textract switches to a different dependency that doesn't use a strict copyleft license.

One alternative epub parsing package that shows promise is epub2txt which carries an MIT license.

To Reproduce Steps to reproduce the behavior:

  1. Create a new project with a requirements.txt containing textract as the only dependency.
  2. Install the license checker liccheck and follow its documentation for creating a liccheck.ini file, but with no approved licenses. (This will force the tool to list out all dependencies with their licenses.
  3. Review the report on the license dependencies (Shown in the Screenshots section)

Expected behavior An MIT licensed library like textract should not (in fact, legally cannot) depend on an AGPL licensed library.

Screenshots

gathering licenses...
14 packages and dependencies.
check unknown packages...
14 packages.
    argcomplete (1.8.2): ['Apache Software']
      dependency:
          argcomplete << textract
    beautifulsoup4 (4.5.3): ['MIT']
      dependency:
          beautifulsoup4 << textract
    chardet (2.3.0): ['GNU Library or Lesser General Public License (LGPL)']
      dependency:
          chardet << textract
    docx2txt (0.6): UNKNOWN
      dependency:
          docx2txt << textract
    EbookLib (0.15): ['GNU Affero General Public']     // *** NOTE THE AGPL LICENSE HERE ***//
      dependency:
          EbookLib << textract
    lxml (4.7.1): ['BSD']
      dependencies:
          lxml << EbookLib << textract
          lxml << python-pptx << textract
    Pillow (9.0.0): ['Historical Permission Notice and Disclaimer (HPND)']
      dependency:
          Pillow << python-pptx << textract
    pocketsphinx (0.1.3): ['BSD']
      dependency:
          pocketsphinx << textract
    python-pptx (0.6.5): ['MIT']
      dependency:
          python-pptx << textract
    six (1.10.0): ['MIT']
      dependencies:
          six << EbookLib << textract
          six << textract
    SpeechRecognition (3.6.3): ['BSD']
      dependency:
          SpeechRecognition << textract
    textract (1.6.1): ['MIT']
      dependency:
          textract
    xlrd (1.0.0): ['BSD']
      dependency:
          xlrd << textract
    XlsxWriter (3.0.2): ['BSD']
      dependency:
          XlsxWriter << python-pptx << textract

Desktop (please complete the following information):

Additional context N/A

deanmalmgren commented 2 years ago

This is an excellent catch @jhale1805 . Thank you. Any chance you (or anyone else in this community) might be able to put together a PR for this?

Looking at epub2txt, it's probably pretty straightforward to update the code here.

thehale commented 2 years ago

As long as I can figure out how to run the test suite to make sure I don't break anything, that doesn't look too bad.

I'll see if I can get that done within the next week or so.

deanmalmgren commented 2 years ago

Amazing; TYTYTYTY! If the test suite causes you any problems, just let me know. You may need to update the test data in this directory

thehale commented 2 years ago

@deanmalmgren I just opened PR #411 which fixes the AGPL dependency issue.

SIDE NOTE: I did have a lot of trouble with the tests. For anyone else contributing to the project, I recommend ignoring the setup instructions in the Contributing.rst (they're outdated) and instead following the commands used in .travis.yml to configure your environment and run the unit test suite.

deanmalmgren commented 2 years ago

Thanks so much for your work on this, @jhale1805 and for your patience in me merging in your PR. This is now released in 1.6.5