PromptEngineer48 / Ollama

This repo brings numerous use cases from the Open Source Ollama
Apache License 2.0
180 stars 98 forks source link

Unable to ingest .docx files #15

Open arehan opened 6 months ago

arehan commented 6 months ago

I've been following the steps in readme and the video tutorial. However, I'm unable to pass through successful ingestion of a docx file. It works fine with .pdf. Anything I need to look into? This is what I get when I type in python3 ingest.py

`Creating new vectorstore Loading documents from source_documents Loading new documents: 0%| | 0/2 [00:02<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/rehan.arif/Documents/Chat with docs/Ollama/2-ollama-privateGPT-chat-with-docs/ingest.py", line 84, in load_single_document return loader.load() ^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 86, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/document_loaders/word_document.py", line 122, in _get_elements from unstructured.partition.docx import partition_docx File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/unstructured/partition/docx.py", line 6, in import docx ModuleNotFoundError: No module named 'docx' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/rehan.arif/Documents/Chat with docs/Ollama/2-ollama-privateGPT-chat-with-docs/ingest.py", line 161, in main() File "/Users/rehan.arif/Documents/Chat with docs/Ollama/2-ollama-privateGPT-chat-with-docs/ingest.py", line 151, in main texts = process_documents() ^^^^^^^^^^^^^^^^^^^ File "/Users/rehan.arif/Documents/Chat with docs/Ollama/2-ollama-privateGPT-chat-with-docs/ingest.py", line 113, in process_documents documents = load_documents(source_directory, ignored_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/rehan.arif/Documents/Chat with docs/Ollama/2-ollama-privateGPT-chat-with-docs/ingest.py", line 102, in load_documents for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value ModuleNotFoundError: No module named 'docx'`

DocMinus commented 1 day ago

for docx you would require pip install python-docx if I am not mistaken. the same for pptx pip install python-pptx