Add doc file converter to haystack

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

16.81k stars 1.84k forks source link

Add doc file converter to haystack #2695

Closed Yafaa closed 7 months ago

Yafaa commented 2 years ago

Is your feature request related to a problem? Please describe. Haystack provide many file converter but a doc file converter is not implemented yet for those kind of files

Describe the solution you'd like Add a doc converter so every time I have documents in doc format I can use haystack to convert them.

anakin87 commented 2 years ago

Hello! As you can see in the documentation, there is the DocxToTextConverter to extract text from docx. If this doesn't work for you, I suggest trying the TikaConverter class: Apache Tika can extract text from a myriad of different formats.

Yafaa commented 2 years ago

Hi This feature request is not about docx files but doc files which is not supported yet by haystack and DocxToTextConverter does not enable to convert doc files

julian-risch commented 2 years ago

@Yafaa You're right that the DocxToTextConverter currently cannot handle the older .doc format. Trying our tutorial 8 (preprocessing) on colab with an older .doc file, I can confirm that I get the error message: ValueError: file 'data/tutorial8/sample_doc.doc' is not a Word file, content type is 'application/vnd.openxmlformats-officedocument.themeManager+xml'

A quick workaround for you could be to convert your .doc to .docx and then use the DocxToTextConverter. https://github.com/python-openxml/python-docx/issues/229#issuecomment-430611713

I will move this issue to our backlog so that we can work on a better solution. Community contributions welcome! 👍

julian-risch commented 2 years ago

Just saw your PR. 👍