Chunking and Text Splitter customization

Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.

https://useanything.com

MIT License

17.06k stars 1.82k forks source link

Chunking and Text Splitter customization #490

Closed timothycarambat closed 3 months ago

timothycarambat commented 6 months ago

We should build out the text-splitting and chunking strategy to be more configurable per workspace. This should be delineated as an advanced setting since in general most should not need to change it. For those who understand the implications should be allowed to modify it.

More on splitters

It would make sense for the UI to have a way to set file extensions => splitter settings with a way to select a document to "preview" how a document would be split in real-time.

j-loquat commented 6 months ago

Consider using intelligent chunking approach like Unstructured: https://medium.com/@unstructured-io

timothycarambat commented 6 months ago

It is likely not wise for us to spin up an embedded container running Unstructured inside of AnythingLLM to rebuild the document parser and/or rely on a third-party API for this. Their tool appears robust and powerful but I am not sure the dependency is worth overhead at this time currently. Will investigate further though

j-loquat commented 6 months ago

Agree it would be a challenge. No question that more intelligent "chunking" of content during ingestion will really help the quality of the future search results, however. I wonder if there would be a way to offer a choice to send the documents to a second optional container that could run a different parser and then return the chunks (JSON format?) back to your main container for embedding? Sort of like an effects loop for a guitar amp.

j-loquat commented 6 months ago

Txtai looks very good as a one stop shop way of doing embeddings and different types of search:

https://neuml.github.io/txtai

junxu-ai commented 6 months ago

We should build out the text-splitting and chunking strategy to be more configurable per workspace. This should be delineated as an advanced setting since in general most should not need to change it. For those who understand the implications should be allowed to modify it.

More on splitters

It would make sense for the UI to have a way to set file extensions => splitter settings with a way to select a document to "preview" how a document would be split in real-time.

strongly agreed with this point, as i noticed that the default embedding is a nightmare to get the right input for LLM generation.

timothycarambat commented 6 months ago

@junxu-ai do you mean the splitting or the embedding?

Stihotvor commented 1 month ago

I would like to re-open the issue. There is still no way to choose the splitting/chunking. I would like to switch to a code-specific chunking, sentence based, semantic, etc.

As of now the only way to do it - chunk outside to separate files and push them into the UI 😄