Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
15.09k stars 1.17k forks source link

[BUG] Word docx failing embedding #435

Open vap0rtranz opened 5 days ago

vap0rtranz commented 5 days ago

Description

Embeddings are failing for Word docx format.

The unstructured loader/reader gives an error.

This is using nomic-embed-text

Reproduction steps

1. In UI, select "Click to Upload" and attach local Word docx 
2. Select "Upload and Index"
3. see

Screenshots

![DESCRIPTION](LINK.png)

Logs

Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f984bfba020>
No module named 'unstructured'
Traceback (most recent call last):
  File "/media/justin/external/CodeReady/venv-external/lib/python3.10/site-packages/ktem/index/file/pipelines.py", line 795, in stream
    file_id, docs = yield from pipeline.stream(
  File "/media/justin/external/CodeReady/venv-external/lib/python3.10/site-packages/ktem/index/file/pipelines.py", line 642, in stream
    docs = self.loader.load_data(file_path, extra_info=extra_info)
  File "/media/justin/external/CodeReady/venv-external/lib/python3.10/site-packages/kotaemon/loaders/unstructured_loader.py", line 70, in load_data
    from unstructured.partition.auto import partition
ModuleNotFoundError: No module named 'unstructured'

Browsers

No response

OS

Linux

Additional information

No response

KKenny0 commented 1 day ago

The module named 'unstructured' might not be installed. You can install it using pip: pip install unstructured.

vap0rtranz commented 3 hours ago

Hmm, OK I installed unstructured. It was indeed not installed. Now there's a different error that blocks the indexing.

It may be faster to reinstall but I've had installation issues: https://github.com/Cinnamon/kotaemon/issues/425