Closed Yafaa closed 7 months ago
Hello!
As you can see in the documentation, there is the DocxToTextConverter
to extract text from docx.
If this doesn't work for you, I suggest trying the TikaConverter
class: Apache Tika can extract text from a myriad of different formats.
Hi This feature request is not about docx files but doc files which is not supported yet by haystack and DocxToTextConverter does not enable to convert doc files
@Yafaa You're right that the DocxToTextConverter
currently cannot handle the older .doc
format. Trying our tutorial 8 (preprocessing) on colab with an older .doc
file, I can confirm that I get the error message:
ValueError: file 'data/tutorial8/sample_doc.doc' is not a Word file, content type is 'application/vnd.openxmlformats-officedocument.themeManager+xml'
A quick workaround for you could be to convert your .doc
to .docx
and then use the DocxToTextConverter
.
https://github.com/python-openxml/python-docx/issues/229#issuecomment-430611713
I will move this issue to our backlog so that we can work on a better solution. Community contributions welcome! 👍
Just saw your PR. 👍
Is your feature request related to a problem? Please describe. Haystack provide many file converter but a doc file converter is not implemented yet for those kind of files
Describe the solution you'd like Add a doc converter so every time I have documents in doc format I can use haystack to convert them.