After using Danswer for a while at BrightInsight, we propose adding a feature to customize chunk sizes when creating a connector.
Main Goals:
Customize Chunk Size: Allow increasing or decreasing the vector database chunk size. Currently, this is set by DOC_EMBEDDING_CONTEXT_SIZE.
Customize Chunk Overlap: Allow increasing or decreasing the vector database chunk overlap. Currently, this is set by CHUNK_OVERLAP.
Specific Details:
This modification will be off by default. To turn it on, we will use the environment setting ENABLE_VECTOR_DB_SETTINGS. This way, Danswer will continue working as usual unless this setting is enabled.
If ENABLE_VECTOR_DB_SETTINGS is true, when adding a new connector, two new fields will appear: one for DOC_EMBEDDING_CONTEXT_SIZE and another for CHUNK_OVERLAP.
Update the connector_credential_pair Table to save the values of DOC_EMBEDDING_CONTEXT_SIZE and CHUNK_OVERLAP. This way, we can reuse these settings when syncing again the connector.
Modify the chunking logic to check if a connector has DOC_EMBEDDING_CONTEXT_SIZE and CHUNK_OVERLAP in the database. If not, use the existing logic.
After using Danswer for a while at BrightInsight, we propose adding a feature to customize chunk sizes when creating a connector.
Main Goals:
DOC_EMBEDDING_CONTEXT_SIZE
.CHUNK_OVERLAP
.Specific Details:
ENABLE_VECTOR_DB_SETTINGS
. This way, Danswer will continue working as usual unless this setting is enabled.ENABLE_VECTOR_DB_SETTINGS
is true, when adding a new connector, two new fields will appear: one forDOC_EMBEDDING_CONTEXT_SIZE
and another forCHUNK_OVERLAP
.connector_credential_pair
Table to save the values ofDOC_EMBEDDING_CONTEXT_SIZE
andCHUNK_OVERLAP
. This way, we can reuse these settings when syncing again the connector.DOC_EMBEDDING_CONTEXT_SIZE
andCHUNK_OVERLAP
in the database. If not, use the existing logic.