Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
27.47k stars 2.77k forks source link

[FEAT]: Max. chunk size should be overridable #2633

Open TeaAlc opened 1 week ago

TeaAlc commented 1 week ago

What would you like to see?

Feature: To not only support one model that was somehow selected at development time and not have to maintain lists for every embedding model it would be great to have a option to overrule the hardcoded embeddingMaxChunkLength.

Explaination: In Text splitting & Chunking Preferences the max chunk size seems to be embedding provider dependend rather than embedding model dependent, this leads through a max length of characters that does not fit to every model.

e.G. image is set for the Azure OpenAI Embedding Provider. The chunk size seems to be from https://github.com/Mintplex-Labs/anything-llm/blob/da3d0283ffee9c592e5b81d2be6a848722df298f/server/utils/EmbeddingEngines/azureOpenAi/index.js#L22C10-L22C34 The model that was used as base seems to be text-embedding-ada-002, but there are already newer models like text-embedding-3-large.

Also it seems that the AnythingLLM embedder counts characters rather than tokens, reducing the amount of data in a vector even further.