Unstructured-IO / unstructured-ingest

Apache License 2.0
20 stars 19 forks source link

Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224

Open jaisir-shadai opened 2 weeks ago

jaisir-shadai commented 2 weeks ago

This PR contains logic to use Hybrid search in Pinecone connector

  1. Create sparse vectors using splade
  2. upsert dense + sparse vectors into pinecone
  3. Allow to upsert to specific namespace
  4. Allow the usage of extra metadata to save

this solves #199

rbiseck3 commented 1 week ago

Would this be a replacement of the existing vector generated by the embedder step or does pinecone take in two different vectors so support hybrid search? I noticed in the PR, that the embedding is now down inline with the upload which we want to avoid. If needed, this might require a new embedder to be added and use that as part of the pipeline.