Open edoyango opened 2 months ago
Tried to prepend each chunk with the article's title and authors, but didn't help at all.
When querying the database for papers authored by Julie Iskander, the Chroma DB similarity search failed to notice that "Julie Iskander" was in the prepended author list.
Changing the surrounding text didn't change anything either e.g. printing a dict rather than a sentence didn't really help.
Alternatively, I could probably use metadata filtering instead: https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/#filtering-on-metadata
I wanted to try other models that might perform better with publication documents (e.g. specter2
.
This led me to try and use langchain_huggingface.embeddingsHuggingFaceEmbeddings
instead of langchain_community.embeddings.ollama.OllamaEmbeddings
because Ollama has pretty limited compatibility with models ref. HuggingFaceEmbeddings
works for more models. These are on a seperate branch (https://github.com/WEHI-ResearchComputing/rag/tree/ollama-to-hf)
Looking at the MTEB leaderboard, I tried Alibaba-NLP/gte-large-en-v1.5
, which gave me better results.
Unlike using mxbai-embed-large-v1
, A relevant chunk with the added author information was pulled. But there were two papers that I annotated, so it was only half right (better than the previous models though).
Salesforce/SFR-Embedding-Mistral
didn't too well despite being higher ranked and larger than Alibaba-NLP/gte-large-en-v1.5
.
Ok I've now understood that to use HuggingFaceEmbeddings, the models have to have a sentence-transformer
model available. If it doesn't langchain will convert the model to a sentence-transformer, but needs to be trained (i.e., will produce nonsense)
Currently, database doesn't seem to pull papers based on author(s) e.g.
Which is partially right as it's one of the papers included in the dataset. Interestingly
data/1-s2.0-S0266352X20300379-main-1.pdf
(my other paper included in the paper) was thought to be more relevant, but not mentioned by the LLM - probably because the database returned a chunk later in the paper.Another example:
Need to figure out how to get the database to return/recognise author information.