Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
17.77k stars 1.38k forks source link

[BUG] Can not retrieve anything when indexed a pdf file using Adobe Reader #389

Closed a652 closed 1 month ago

a652 commented 1 month ago

Description

indexing step output

use_quick_index_mode False reader_mode adobe Using reader <kotaemon.loaders.adobe_loader.AdobeReader object at 0x15ff5dc30> Got 0 page thumbnails Adding documents to doc store Getting embeddings for 66 nodes Adding embeddings to vector store indexing step took 6.741843223571777

chating step output

User-id: 1, can see public conversations: True Session reasoning type None Session LLM None Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'> Reasoning state {'app': {'regen': False}, 'pipeline': {}} Thinking ... Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x1616b26e0>, FSPath=PosixPath('/Users/zhangcheng/code/python/kotaemon/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x1616b2710>, get_extra_table=True, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x1722b8700>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x1722bbdf0>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x1722b8730>), mmr=True, rerankers=[CohereReranking(cohere_api_key='', model_name='rerank-multilingual-v2.0')], retrieval_mode='hybrid', top_k=10, userid=1), GraphRAGRetrieverPipeline(DS=<theflow.base.unset object at 0x1055aee90>, FSPath=<theflow.base.unset object at 0x1055aee90>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset object at 0x1055aee90>, VS=<theflow.base.unset_ object at 0x1055aee90>, file_ids=[], userid=<theflow.base.unset object at 0x1055aee90>)] searching in doc_ids ['8e51f681-7544-4979-95c8-e423667a1107'] retrieval_kwargs: dict_keys(['do_extend', 'scope', 'filters', 'mode', 'mmr_threshold']) Got 0 from vectorstore Got 0 from docstore Cohere API key not found. Skipping rerankings. Got raw 0 retrieved documents thumbnail docs 0 non-thumbnail docs 0 raw-thumbnail docs 0 retrieval step took 1.1629290580749512 Got 0 retrieved documents len (original) 0 Got 0 images Trying LLM streaming Got 0 cited docs

Reproduction steps

1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

Screenshots

![DESCRIPTION](LINK.png)

Logs

No response

Browsers

Chrome

OS

MacOS

Additional information

The same operational steps allow the content to be searched correctly using PDFThumbnailReader, but when indexing the file with AdobeReader, no content can be retrieved. Any suggestions?

a652 commented 1 month ago

there is 'Chinese content' in the .pdf, does the language matters?