Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
17.43k stars 1.34k forks source link

[BUG] - <When uploading more than one PDF, only one PDF gets indexed> #176

Closed aug2umbc closed 2 months ago

aug2umbc commented 2 months ago

Description

I have tried this multiple times with and without the "Forced Index file" checked. Each time, the outcome is the same. The first PDF gets indexed fine, but the second one causes a crash. This occurs even if I upload files one at a time and indexing files one at a time.

Any help would be appreciated and allow me to use this application.

I am using Ollama for both embedding and Chat. Below are my screenshots. Ollama works great for me, but uploading more than one PDF is causing issues:

1 2

Reproduction steps

I am booting up the application using "run_windows.bat" and using Ollama LLMs. 

This is an incredible tool with tremendous potential. I would greatly appreciate any help troubleshooting the problem below, which would allow me to use this incredible tool.

I get a "Connection Errored Out" message when uploading more than one file. The first file gets uploaded and indexed fine, but uploading more than one file generates the above message.

I have uploaded GIF of the error message and a screen recording of my screen here: (https://drive.google.com/drive/folders/1o_JQxq-Qp8FZMz4q5Tp1BLYjYKmz7MOQ?usp=sharing)

Screenshots

No response

Logs

No response

Browsers

Firefox, Chrome, Microsoft Edge

OS

Windows

Additional information

Video GIF and MP4 of the error screen recording is at the following link: https://drive.google.com/drive/folders/1o_JQxq-Qp8FZMz4q5Tp1BLYjYKmz7MOQ?usp=sharing

phv2312 commented 2 months ago

Hi, is this happent to all pdf files ? Can you try multiple but different pdf files ? I doubt that there are something wrong with the second pdf ? Would you mind to send us the second file to investigate more ?

aug2umbc commented 2 months ago

I have uploaded PDFs to the same Google Drive folder: https://drive.google.com/drive/folders/1fdHT5cKTsxkLpF3aXuv27VhxoY07fSgt?usp=sharing

There are a total of 4 PDFs that I was doing trial and error with.

aug2umbc commented 2 months ago

Hi, is this happent to all pdf files ? Can you try multiple but different pdf files ? I doubt that there are something wrong with the second pdf ? Would you mind to send us the second file to investigate more ?

Yes, the issue is happening to all pdf files. I have just now confirmed that with Ollama this issue is happening no matter the size of the pdfs or the order in which they are uploaded. I have tried both nomic-embed-text:latest and mxbai-embed-large:latest for the embedding models. Same outcome as shown in my screen recordings (https://drive.google.com/drive/folders/1o_JQxq-Qp8FZMz4q5Tp1BLYjYKmz7MOQ?usp=sharing)

Thank you for your quick reply.

matiasdev30 commented 2 months ago

@phv2312 can I fix this issue?

taprosoft commented 2 months ago

Seem this issue happened with ollama in Windows. Please check if this reproducible @phv2312. Thanks for you report @aug2umbc.

matiasdev30 commented 2 months ago

I can check @phv2312

phv2312 commented 2 months ago

Sure @matiasdev30 , your help is more than welcome

phv2312 commented 2 months ago

Hi @aug2umbc , sorry for late reply, can you try to install the following: pip install chromadb==0.5.0. I have found some similar problems on chroma here https://github.com/chroma-core/chroma/issues/2513 . It suggests downgrading to chroma 0.5.0 and chroma-hnswlib 0.7.3 will work. I have tried on my machine and it works. Can you try on your machine too

aug2umbc commented 2 months ago

pip install chromadb==0.5.0

Installing chromadb==0.5.0 worked!

Thank you so much. This will help others as well.

Niko-La commented 2 months ago

Installing chromadb==0.5.0 worked!

bash into docker container to verify

root@3e9c3d102a33:/app# pip show chromadb | grep Version
Version: 0.5.0

Dockerfile

RUN --mount=type=ssh pip install --no-cache-dir -e "libs/kotaemon[all]" \
    && pip install --no-cache-dir -e "libs/ktem" \
    && pip install --no-cache-dir graphrag future \
    && pip install --no-cache-dir "pdfservices-sdk@git+https://github.com/niallcm/pdfservices-python-sdk.git@bump-and-unfreeze-requirements" \
    && pip install --no-cache-dir llama-index-vector-stores-milvus \
    && pip install --no-cache-dir chromadb==0.5.0

image

@phv2312 i think this is embedding as expected? it still would be nice to get easier way to let user know its been index properly. per haps a simple checkmark

The error

❌ | fall21-bs-knowlgeandskills.pdf: RetryError[<Future at 0x7054283420b0 state=finished raised APIConnectionError>]

I suppose is more about local llama iisue