SAP / credential-digger

A Github scanning tool that identifies hardcoded credentials while filtering the false positive data through machine learning models :lock:
Apache License 2.0
317 stars 49 forks source link

[similarity] use a vector db for embeddings #294

Open marcorosa opened 8 months ago

marcorosa commented 8 months ago

Postgres now offers an extension to store vectors (pgvector). We could leverage it to store embeddings for our similarity feature, because that's what vector dbs can do best.

Why pgvector? what about another vector db? We already have Postgres in place, so it would be reasonable not to add another component (our stack is already complicated). Nevertheless, pgvector would require the installation of postgres for all users, also for those fostering sqlite. So, we have 2 options here: (i) either we integrate vectordb capabilities only for PgClient users (leaving SqliteClient users storing the embeddings as text in sqlite), or (ii) we add a local vector db (like chromadb or FAISS)

Note to myself: option (i) is the more conservative choice, and it could be the starting point

marcorosa commented 8 months ago

Does this issue require #246 to be resolved first?