Investigate limitations and lessons learned from previous work

Bisaloo commented 2 months ago

Related work:

Bisaloo commented 1 month ago

Here is my summary, with fingers crossed I didn't misunderstand anything.

pkgsearch / https://www.r-pkg.org/

Powered by elastic search

Pros

Two interfaces to the same engine (R package + website)
Returns a numeric value for relevance

Cons

Only compares query to the package description from DESCRIPTION
Limited ability to understand synonyms & context

llm-guidance

Pros

Specific to epi / epiverse packages
Matches both on docs and source code
Uses latest advances in NLP

Cons

Front-end as a shiny app
Database & embeddings updated not automated
Relies on a paid & closed source language model

Bisaloo commented 1 month ago

I asked Adam for his advice and feedback and he confirmed he was using the approach suggested in https://github.com/epiverse-connect/epiverse-search/issues/4#issuecomment-2133422997 and it was giving satisfactory results for the most part.

Detailed blog post on the approach: https://kucharski.substack.com/p/transforming-language-models-into

The main weakness was to answer very short queries, which didn't have enough content & context to position the point in a sensible manner in the embedding space.

This caveat is also discussed in a recent blog post:

Nonetheless, while embeddings are undoubtedly a powerful tool, they are not the be all and end all. First, while they excel at capturing high-level semantic similarity, they may struggle with more specific, keyword-based queries, like when users search for names (e.g., Ilya), acronyms (e.g., RAG), or IDs (e.g., claude-3-sonnet). Keyword-based search, such as BM25, are explicitly designed for this. And after years of keyword-based search, users have likely taken it for granted and may get frustrated if the document they expect to retrieve isn’t being returned.

chartgerink commented 1 month ago

Thanks for this great input @Bisaloo!

If I may add to the pkgsearch option: they also provide an open API that already does a lot of heavy lifting. We can build our scrapers on top of this. The only downside is that the API is not well documented, as far as i can find. I was able to figure out the API path for one package:

https://crandb.r-pkg.org/<package>

For the llm-guidance option, this is a great conceptual example. Adding to the limitations, I would propose that the Shiny app merges a lot of front-end and back-end operations in one place, and requires server side operations in one place - in other words: it is not modular.

epiverse-connect / epiverse-search