deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.84k stars 585 forks source link

Replace Antiword with a Python alternative #468

Open SMillerDev opened 1 year ago

SMillerDev commented 1 year ago

Is your feature request related to a problem? Please describe. Antiword hasn't been updated for a while and now the source has completely disappeared. It would be good to use an alternative way to parse word files.

Which filetype should textract support? docx

Which external software (python or command line tool), can parse the requested file type https://pypi.org/project/docx-parser/

Describe alternatives you've considered Nothing is done and package managers drop antiword and all it's dependencies inclusing textract

Additional context Relates to https://github.com/Homebrew/homebrew-core/pull/131387

michelemaroni commented 1 year ago

According to the documentation antiword is used for parsing old MS Word binary doc files (Word 97-2003), while newer MS Word docx files are parsed with python-docx2txt. It is not clear how docx-parser would help with former Word 97-2003 files.

One issue to consider is that doc extension can be either a Word 97-2003 or a newer Word file. Maybe abiword could be a better alternative in this regard.

SMillerDev commented 1 year ago

Thanks for pointing that out, I must have misread what antiword was actually used for. I don't actually use textract so unfortunately I can't help much with the consideration for Abiword, I just wanted to make sure that the team here was aware of the disappearance of Antiword.