epiverse-connect / epiverse-search

MIT License
1 stars 0 forks source link

Design an Epiverse-aware LLM #4

Open Bisaloo opened 2 months ago

Bisaloo commented 2 months ago

Considerations:

paulkorir commented 2 months ago

We can download an offline LLM and tune it using tools like the llama-index.

Bisaloo commented 1 month ago

As discussed last week with the WHO PEI team, I wonder if we should have a generative AI engine under the hood.

A simpler alternative which would:

would be to use a language model to compute embeddings for each package based on their documentation and then find the nearest neighbour(s) for the embedding of the search query.

There are a couple of open questions such as the impact of longer of shorter documentations & length difference between package embeddings and query embedding.

What do you think of this idea?

paulkorir commented 1 month ago

I don't know too much about embedding spaces but from what I know the dimensionality of the embedding space will be a function of the embedding model we choose. The higher the dimension of the space the sparser it will be meaning that any nearest neighbour search would require considerable calibration. Just my intuition. From what I've read, the embedding vectors make most sense when used in the context of some LLM and may be hard to work with independently.

I'm not averse to the stochasticity of the search results––I think savvy users will not be surprised by it. I've read of some frameworks for preventing hallucinations so we should not be too worried about that either.

@Bisaloo I wrote to you asking if it made sense to merge both into one project. What do you think? If both are tightly integrated then I would propose treating them as one tool with two perspectives of the same data: tool discoverability.

Bisaloo commented 1 month ago

I don't know too much about embedding spaces but from what I know the dimensionality of the embedding space will be a function of the embedding model we choose. The higher the dimension of the space the sparser it will be meaning that any nearest neighbour search would require considerable calibration. Just my intuition.

I don't think this is an issue. This has been a known need and a very active area of research for the past 15 years a least. There are many very good tools to deal with it, such as, among many others, annoy.

From what I've read, the embedding vectors make most sense when used in the context of some LLM and may be hard to work with independently.

I confirmed in #9 that early iterations with this idea produced sensible results.

@Bisaloo I wrote to you asking if it made sense to merge both into one project. What do you think? If both are tightly integrated then I would propose treating them as one tool with two perspectives of the same data: tool discoverability.

Let's discuss it at our next catch-up.