Open mihail911 opened 1 month ago
Late chunking requires token level embeddings right ? But with closed source models, we do not have the flexibility to obtain token level embeddings. Right now, this can be done with only open source embedding models
So we can switch to a opensource model, and try if it outperforms?
Well Jina has not compared it to the closed source models, so there aren't any benchmarks to compare yet
Is your feature request related to a problem? Please describe. We should explore alternative chunking strategies that may outperform. Empirically this late chunking strategy seems to do well: https://arxiv.org/pdf/2409.04701 https://colab.research.google.com/drive/15vNZb6AsU7byjYoaEtXuNu567JWNzXOz?usp=sharing https://jina.ai/news/late-chunking-in-long-context-embedding-models/ https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii/ https://github.com/jina-ai/late-chunking
Describe the solution you'd like implement a new chunker and then experiment with it