chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

Vector search similar movies with embedding generated by item2vec and CBOW model #150

Closed Daniel-Robbins closed 9 months ago

Daniel-Robbins commented 9 months ago

The main purpose of this demo is to demonstrate how to train the vector representation of items using Word2vec and make item recommendations based on the similarity of item vectors. It mainly consists of 4 parts:

  1. Prepare item sequences based on user behavior.
  2. Train a CBOW model using the Word2Vec module of the gensim library.
  3. Extract all embedding data and write it to chDB.
  4. Perform queries on chDB based on cosine distance to find similar movies to the input movie.
  5. A simple unittest for vector data insertion and querying.
lmangani commented 9 months ago

Thanks @Daniel-Robbins for all the amazing contributions 🤟