Can't process docx files even though docx2txt is installed

deanmalmgren / textract

extract text from any document. no muss. no fuss.

http://textract.readthedocs.io

MIT License

3.9k stars 602 forks source link

Can't process docx files even though docx2txt is installed #521

Open ShvetsIvan opened 3 months ago

ShvetsIvan commented 3 months ago

Hello,

I am trying to use textract to do the obvious with docx files in a AWS Lambda using python. Textract library is included in the package, as is the dependency - docx2txt. I try getting the text out of the file, but still getting the ExtensionNotSupported stating that docx is not supported. I tried putting the doc2txt library in the parsers folder too - didn't help.

Using:

Textract version 1.6.3
Python version 3.11
AWS Lambda function

phil-scholarcy commented 2 months ago

Is the file definitely a docx file and not a .doc file masquerading as one? I find there can be issues with the following scenarios:

A .docx file is given a .doc extension
A .doc file is given a .docx extension