deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.9k stars 602 forks source link

Can't process docx files even though docx2txt is installed #521

Open ShvetsIvan opened 3 months ago

ShvetsIvan commented 3 months ago

Hello,

I am trying to use textract to do the obvious with docx files in a AWS Lambda using python. Textract library is included in the package, as is the dependency - docx2txt. I try getting the text out of the file, but still getting the ExtensionNotSupported stating that docx is not supported. I tried putting the doc2txt library in the parsers folder too - didn't help.

image

Using:

phil-scholarcy commented 2 months ago

Is the file definitely a docx file and not a .doc file masquerading as one? I find there can be issues with the following scenarios:

  1. A .docx file is given a .doc extension
  2. A .doc file is given a .docx extension