Storia-AI / sage

Chat with any codebase in under two minutes | Fully local or via third-party APIs
https://sage.storia.ai
Apache License 2.0
1.05k stars 89 forks source link

Feature request: Implement late chunking #57

Open mihail911 opened 1 month ago

mihail911 commented 1 month ago

Is your feature request related to a problem? Please describe. We should explore alternative chunking strategies that may outperform. Empirically this late chunking strategy seems to do well: https://arxiv.org/pdf/2409.04701 https://colab.research.google.com/drive/15vNZb6AsU7byjYoaEtXuNu567JWNzXOz?usp=sharing https://jina.ai/news/late-chunking-in-long-context-embedding-models/ https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii/ https://github.com/jina-ai/late-chunking

Describe the solution you'd like implement a new chunker and then experiment with it

LuciAkirami commented 1 week ago

Late chunking requires token level embeddings right ? But with closed source models, we do not have the flexibility to obtain token level embeddings. Right now, this can be done with only open source embedding models

Saksham1387 commented 1 week ago

So we can switch to a opensource model, and try if it outperforms?

LuciAkirami commented 4 days ago

Well Jina has not compared it to the closed source models, so there aren't any benchmarks to compare yet