Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
11.61k stars 847 forks source link

Issue: Error in create_base_entity_graph Step During Indexing #140

Open Tom1009840152 opened 2 weeks ago

Tom1009840152 commented 2 weeks ago

Description

Hi, First of all, thank you for this great project! I have been experimenting with it and encountered an issue that I hope you can help me with.

Description I cloned the repository and followed the instructions to upload and index a PDF file. The file was successfully uploaded and processed into chunks, but the process failed at the create_base_entity_graph step. Below are the details:

Reproduction steps

1. Uploaded a PDF file.
2. The file was converted to text and processed into 432 chunks.
3. The indexing process started and created base text units and extracted entities.
The process failed at the create_base_entity_graph step.

Screenshots

No response

Logs

Indexing [1/1]: ewfacve.pdf
 => Converting ewfacve.pdf to text
 => Converted ewfacve.pdf to text
 => [ewfacve.pdf] Processed 432 chunks
 => Finished indexing ewfacve.pdf
[GraphRAG] Creating index... This can take a long time.
Logging enabled at /app/ktem_app_data/user_data/files/graphrag/56bbcd91-b07f-4ca4-a904-cce71bdf4571/output/20240828-110528/reports/indexing-engine.log

πŸš€ create_base_text_units

                                   id  ... n_tokens

0    b85337bedc1f1de271961c7251d46b5c  ...      918

1    0b9e954415cd9c6376528587903a9a96  ...      748

2    d03cbffc5b626210553dd466e022d2ee  ...      891

3    d28fa2ce915c137677eaaa2aabdaa304  ...      732

4    3a7d2941f7ba32940874d0f06f9f62f4  ...     1141

..                                ...  ...      ...

132  54d5b9a2b71895341def789d88113ef4  ...     1200

133  53f5ed0fb3870cf8e59a8f55c1271f3a  ...      398

134  ea40a2366f7ad18819cfa804827eb4d4  ...     1183

135  f1e686b8519e7d93ce6e461d9fa8d1a0  ...       83

136  9eac5a3d46b4b013fef67b5c78330d89  ...      886

[275 rows x 5 columns]

πŸš€ create_base_extracted_entities

                                        entity_graph

0  <graphml xmlns="http://graphml.graphdrawing.or...

πŸš€ create_summarized_entities

                                        entity_graph

0  <graphml xmlns="http://graphml.graphdrawing.or...

❌ create_base_entity_graph

None

β ‹ GraphRAG Indexer 

β”œβ”€β”€ Loading Input (text) - 216 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00

β”œβ”€β”€ create_base_text_units

β”œβ”€β”€ create_base_extracted_entities

β”œβ”€β”€ create_summarized_entities

└── create_base_entity_graph❌ Errors occurred during the pipeline run, see logs for more details.

Browsers

Chrome

OS

Windows

Additional information

No response

mohdyasa commented 2 weeks ago

I am also facing the same issue. If someone can share the working env file it would be good as I believe we also need to put graphRag config

taprosoft commented 2 weeks ago

Hi there is high chance that these env_variable is not properly set.

# settings for GraphRAG
GRAPHRAG_API_KEY=openai_key
GRAPHRAG_LLM_MODEL=gpt-4o-mini
GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small

It is in the .env file. https://stackoverflow.com/questions/48607302/using-env-files-to-set-environment-variables-in-windows You can try to follow these procedure.

Tom1009840152 commented 2 weeks ago

Thank you for your answer. I have made the configurations I can use Rag normally, but I cannot use Graphrag. I guess I need to configure graphRag?

20240828205317

taprosoft commented 2 weeks ago

Yes. You need to setup the environment variable above. Due to one limitation of our current implementation these GraphRAG env var won't read automatically from .env. We will work on an easy way to setup GraphRAG parameter on the UI in the next release.

mohdyasa commented 2 weeks ago

What should go in GRAPHRAG_API_KEY=openai_key? I should use the same Azure Open AI key that I am using for AzureOpenAI?

Laksh-star commented 2 weeks ago

Got the same error in Mac. Setting up the environment variables solved the issue, using dotenv run -- python app.py

mohdyasa commented 2 weeks ago

@Laksh-star what value did you provide in GRAPHRAG_API_KEY? I am using Azure OpenAI

Laksh-star commented 2 weeks ago

I used openAI, so gave that key value for GRAPHRAG_API_KEY. Logically you should use Azure OpenAI. But the sample env file here gives this.

# settings for GraphRAG
GRAPHRAG_API_KEY=openai_key
GRAPHRAG_LLM_MODEL=gpt-4o-mini
GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small

you can try first with Azure.

mohdyasa commented 2 weeks ago

I put AzureOpenAI key in place of GRAPHRAG_API_KEY=openai_key but it didn't worked

taprosoft commented 2 weeks ago

For Azure OpenAI, please follow https://microsoft.github.io/graphrag/posts/config/env_vars/ Gonna be a bit more complicated than normal OpenAI

image

taprosoft commented 2 weeks ago

Also use the command suggested here https://github.com/Cinnamon/kotaemon/issues/140#issuecomment-2315706967 to load from .env file at start up.

taprosoft commented 2 weeks ago

Or, alternately https://pypi.org/project/python-dotenv-run/

ryansh1x commented 2 weeks ago

Hi - can we use a local / Ollama embedding model instead of using something requiring an API key?

lone17 commented 2 weeks ago

@ryansh1x yes you can, anything that uses openai compatible API will work. You just need to add your embedding endpoint and then create a new file collection that uses that endpoint. Also, please consider making a separate issue so that others with the same question can refer to.

ryansh1x commented 2 weeks ago

@ryansh1x yes you can, anything that uses openai compatible API will work. You just need to add your embedding endpoint and then create a new file collection that uses that endpoint. Also, please consider making a separate issue so that others with the same question can refer to.

will do, shortly - appreciate your input. Once i've gotten the new issue created, I'd appreciate a more noob-level pointer - if I can get the graph portion of this working locally, I'll be able to argue for swapping into this for the majority of my use case.

lone17 commented 2 weeks ago

@ryansh1x sure thing, feel free to try it out and submit any issue/question. I'm not familiar with setting up the graph portion as I'm also a noob myself, but I'm sure other folks can help you out. Cheers !

taprosoft commented 2 weeks ago

Hi - can we use a local / Ollama embedding model instead of using something requiring an API key?

Note that GRAPHRAG with Ollama OpenAI endpoint is also a bit tricky. Please wait why we are streamlining this process to mainstream user.

Neptoos commented 2 weeks ago

Had similar issues with ollama and graphrag. Setting the graphrag version to 0.3.2 fixed the issues for me. I can now use ollama with graphrag. Edit this line: "RUN pip install graphrag future" to "RUN pip install graphrag=0.3.2 future" Rebuild the image.

bookandlover commented 1 week ago

Graphrag chat not work.

image image
ikros98 commented 1 week ago

Had similar issues with ollama and graphrag. Setting the graphrag version to 0.3.2 fixed the issues for me. I can now use ollama with graphrag. Edit this line: "RUN pip install graphrag future" to "RUN pip install graphrag=0.3.2 future" Rebuild the image.

Could you please share with us what are your graphrag variables in the .env?

I have set the following but still not working for me:

# settings for GraphRAG
GRAPHRAG_API_KEY=openai_key
GRAPHRAG_LLM_MODEL=llama3:70b
GRAPHRAG_EMBEDDING_MODEL=nomic-embed-text
TILKPROD commented 1 week ago

bonjour, j'ai aussi le mΓͺme problΓ©me, remplacer la ligne Β« RUN pip install graphrag future Β» en Β« RUN pip install graphrag=0.3.2 future Β» et recontruire l'image ne l'a pas resolue la mΓͺme erreur que celle de Tom1009840152. connaissez vous svp une solution?

clark874 commented 1 week ago

Got the same error in Mac. Setting up the environment variables solved the issue, using dotenv run -- python app.py

I encountered the same problem when using Macbook Pro M1, but the above method is very effective. The issue lies in the env parameters not being read correctly. I hope the next version can fix this bug. Also, I found that using different reasoning methods, some cannot read graphrag information.