Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
17.49k stars 1.35k forks source link

[BUG] Cannot find highlights in the reference doc #469

Open Ruoyu-y opened 2 weeks ago

Ruoyu-y commented 2 weeks ago

Description

After asking a question related to the doc i uploaded, the answer is quite relevant and accurate. However, there's no highlight showing on the reference in the information panel, which makes me hard to find the exact reference. I could also see errors like this in the log:

CitationPipeline: {"evidences":"[\"CAGRA stands for Center of Analysis and Graphics Research.\", \"It focuses on advanced research in computer graphics, visualization, and related fields.\"]"}
1 validation error for CiteEvidence
evidences
  Input should be a valid list [type=list_type, input_value='["CAGRA stands for Cente..., and related fields."]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/list_type

Any suggestion?

Reproduction steps

1. Setup the Kotaemon following the guide
2. Upload your own files
3. Ask a question related to the file
4. No highlights found

Screenshots

![DESCRIPTION](LINK.png)

Logs

User-id: 1, can see public conversations: True
Session reasoning type None
Session LLM None
Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'>
Reasoning state {'app': {'regen': False}, 'pipeline': {}}
Thinking ...
Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x748972b54310>, FSPath=PosixPath('/home/sdp/kotaemon/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x748972b56470>, get_extra_table=False, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x74894d3fef20>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x74894d3fd6c0>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x74894d3fce20>), mmr=False, rerankers=[CohereReranking(cohere_api_key='<COHERE_API_KEY>', model_name='rerank-multilingual-v2.0')], retrieval_mode='hybrid', top_k=10, user_id=1), GraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x748af1dd25c0>, FSPath=<theflow.base.unset_ object at 0x748af1dd25c0>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x748af1dd25c0>, VS=<theflow.base.unset_ object at 0x748af1dd25c0>, file_ids=[], user_id=<theflow.base.unset_ object at 0x748af1dd25c0>)]
searching in doc_ids ['9f0e4d1f-2f61-4f7a-8e3b-dab5ababf92f', '47f769f5-a12e-4543-9e99-9b05b2a1fd5e']
retrieval_kwargs: dict_keys(['do_extend', 'scope', 'filters'])
Number of requested results 100 is greater than number of elements in index 43, updating n_results = 43
Got 43 from vectorstore
Got 43 from docstore
Cohere API key not found. Skipping rerankings.
Got raw 10 retrieved documents
thumbnail docs 3 non-thumbnail docs 7 raw-thumbnail docs 0
retrieval step took 1.082975149154663
Got 10 retrieved documents
len (original) 24156
len (trimmed) 24156
Got 3 images
Trying LLM streaming
CitationPipeline: invoking LLM
CitationPipeline: finish invoking LLM
CitationPipeline: {"evidences":"[\"CAGRA stands for Center of Analysis and Graphics Research.\", \"It focuses on advanced research in computer graphics, visualization, and related fields.\"]"}
1 validation error for CiteEvidence
evidences
  Input should be a valid list [type=list_type, input_value='["CAGRA stands for Cente..., and related fields."]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/list_type
LLM rerank scores [1.0, 1.0, 0.9, 0.9, 0.9, 0.9, 0.9, 0.8, 0.7, 0.7]
Got 0 cited docs

Browsers

Chrome

OS

Linux

Additional information

No response