Closed timothycarambat closed 3 months ago
Consider using intelligent chunking approach like Unstructured: https://medium.com/@unstructured-io
It is likely not wise for us to spin up an embedded container running Unstructured inside of AnythingLLM to rebuild the document parser and/or rely on a third-party API for this. Their tool appears robust and powerful but I am not sure the dependency is worth overhead at this time currently. Will investigate further though
Agree it would be a challenge. No question that more intelligent "chunking" of content during ingestion will really help the quality of the future search results, however. I wonder if there would be a way to offer a choice to send the documents to a second optional container that could run a different parser and then return the chunks (JSON format?) back to your main container for embedding? Sort of like an effects loop for a guitar amp.
Txtai looks very good as a one stop shop way of doing embeddings and different types of search:
We should build out the text-splitting and chunking strategy to be more configurable per workspace. This should be delineated as an advanced setting since in general most should not need to change it. For those who understand the implications should be allowed to modify it.
It would make sense for the UI to have a way to set file extensions => splitter settings with a way to select a document to "preview" how a document would be split in real-time.
strongly agreed with this point, as i noticed that the default embedding is a nightmare to get the right input for LLM generation.
@junxu-ai do you mean the splitting or the embedding?
I would like to re-open the issue. There is still no way to choose the splitting/chunking. I would like to switch to a code-specific chunking, sentence based, semantic, etc.
As of now the only way to do it - chunk outside to separate files and push them into the UI 😄
We should build out the text-splitting and chunking strategy to be more configurable per workspace. This should be delineated as an advanced setting since in general most should not need to change it. For those who understand the implications should be allowed to modify it.
More on splitters
It would make sense for the UI to have a way to set file extensions => splitter settings with a way to select a document to "preview" how a document would be split in real-time.