BoxcarsAI / boxcars

Building applications with composability using Boxcars with LLM's. Inspired by LangChain.
MIT License
434 stars 39 forks source link

Embeddings with hnswlib #48

Closed jaigouk closed 1 year ago

jaigouk commented 1 year ago

there are multiple options for embeddings but most of them are paid versions. and recommended requirement for milvus is 32GB ram for standalone version. I wanted to start with hnswlib.

changes

why?

jaigouk commented 1 year ago

I want to add integration specs and confirm the search. after this, i wish to add pgvector or redis embeddings with extra PRs later. so i wish to clean the up embeddings directory.

jaigouk commented 1 year ago

ready for review @francis when you have time, please take a look.

francis commented 1 year ago

This has the same problem as the other PR that you fixed. Can you put that work here to support the other rubies so that these tests pass, please?

francis commented 1 year ago

@jaigouk - did you get a chance to add the other rubies support?

Also, I'm trying to come up with a nice demo of this ability. Maybe something along the lines of this - https://github.com/hwchase17/notion-qa?

jaigouk commented 1 year ago

I will rebase and check

jaigouk commented 1 year ago

try to resolve https://github.com/BoxcarsAI/boxcars/issues/13 with Hnswlib

# The Notion_DB data is from https://github.com/hwchase17/notion-qa

require 'dotenv/load'
lib_path = File.expand_path('../../lib', __dir__)
$LOAD_PATH.unshift(lib_path) unless $LOAD_PATH.include?(lib_path)
require 'boxcars'

store = Boxcars::Embeddings::Hnswlib::BuildVectorStore.call(
  training_data_path: './Notion_DB/**/*.md',
  index_file_path: './hnswlib_notion_db_index.bin',
  force_rebuild: false
)
openai_client = OpenAI::Client.new(access_token: ENV.fetch('OPENAI_API_KEY', nil))

similarity_search = Boxcars::Embeddings::SimilaritySearch.new(
  embeddings: "#{File.dirname('./')}/hnswlib_notion_db_index.json",
  vector_store: store[:vector_store],
  openai_connection: openai_client
)

returns

{
  :document=>
  "we provide you with a laptop that suits your job. Ask HR for further info.\n
- **Workplace**: \nwe've built a pretty nice office to make sure you like being at Blendle HQ. Feel free to sit where you want. Even better: dare to switch your workplace every once in a while.\n\n# Work at Blendle\n\n
---\n\nIf you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/).

 If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2)."

, :distance=>120
}

I guess we can refine the interface further with other PRs later. for pgvector, I found https://github.com/ankane/neighbor and trying to make it working for rom.rb https://github.com/jaigouk/rom-neighbor

if people dont't want to call openai for query vector, then we need to use a package like https://github.com/facebookresearch/fastText but that is a python package. I don't know there is a ruby gem for that.

obie commented 1 year ago

hey folks, just wanted to point you in the direction of Marqo, which I've been using in my own prototypes as a vector DB with built-in generation of embedding and ability to run easily as docker instance. https://www.marqo.ai/