Teddy-XiongGZ / MedRAG

Code for the MedRAG toolkit
https://teddy-xionggz.github.io/benchmark-medical-rag/
Other
181 stars 29 forks source link

Pre-computed embeddings not downloading? #20

Open ddofer opened 2 weeks ago

ddofer commented 2 weeks ago

Pre-computed embeddings of Contriever/MedCPT/SPECTER on PubMed/Textbooks/Wikipedia will be now be automatically downloaded when initializing a MedRAG object for the first time.

When running on MedCorp, pre-computed embeddings are not downloaded. Instead they are calced locally. (I see them being found 1 by one on Pubmed, nvm other sources).

medrag = MedRAG(llm_name=LL_NAME, rag=True, 
                retriever_name="MedCPT",
                 corpus_name="MedCorp", # 3.5 hours
                corpus_cache=True,HNSW=True
               )
answer, snippets, scores = medrag.answer(question=question, options=options, k=12)

[In progress] Embedding the pubmed corpus with the ncbi/MedCPT-Article-Encoder retriever...
No sentence-transformers model found with name ncbi/MedCPT-Article-Encoder. Creating a new one with CLS pooling.
  4%|███▍                                                                                        | 44/1166 [02:46<5:35:22, 17.93s/it]

Environment: WSL2. Corpuses already downloaded previously and indexed with BM25 (including medcorp and Textbooks, Statpearls)

multydoffer commented 2 weeks ago

where to download the corps?

ddofer commented 2 weeks ago

Issue may relate to an interrupted download or previous partial indexing. But unsure

Teddy-XiongGZ commented 2 weeks ago

Please remove the current embedding folder and try the code again. The pre-computed embeddings won't be downloaded if there is an embedding directory already (see https://github.com/Teddy-XiongGZ/MedRAG/blob/main/src/utils.py#L164).