deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

Drop python2 support #433

Open tehabstract opened 2 years ago

tehabstract commented 2 years ago

Dropping python2 support, loosening up dependencies. Please comment if you want dependencies in a different format, or any changes and I will adjust.

Introduced openpyxl for xlsx files. Updated 2 test files:

Updated travis, vagrant, dockerfile in tests.

Upped the version to 1.7.0, added to changelog.

Thanks

twolfvb commented 1 year ago

@deanmalmgren Any chance this could get looked into? Python 2 was left with no support on Jan 1 2020, and the older packages required for textract to work with 2.7 do cause conflicts. In particular, our team would appreciate bumping pdfminer.six to a newer version.

pdfminer.six >= 20200726 is required for using unstructured, which is required by langchain!

thehunmonkgroup commented 1 year ago

Quick note that I've tested this patch lightly, the only problem I've found so far relates to an update to Python's subprocess module:

diff --git a/textract/parsers/utils.py b/textract/parsers/utils.py
index 11ec8a1..efb0d9c 100755
--- a/textract/parsers/utils.py
+++ b/textract/parsers/utils.py
@@ -83,7 +83,7 @@ class ShellParser(BaseParser):
         """

         # run a subprocess and put the stdout and stderr on the pipe object
-        if subprocess.mswindows:
+        if subprocess._mswindows:
             startupinfo = subprocess.STARTUPINFO()
             startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
         else:

Otherwise it's been working well for me.