Semantic Chunking Chunk Size Bug

seankim658 commented 2 months ago

Llamaindex's SemanticSplitterNodeParser can sometimes produce chunks that are too large for the embedding model. Unfortunately there is no max length option for the semantic chunking to avoid this issue.

Will have to eventually subclass the SemanticSplitterNodeParser and create a two level safety net that will naively split large chunks into sub-chunks in order to stay under the embedding model input token limits.

Reference: https://github.com/run-llama/llama_index/issues/12270

a-gorczew commented 1 month ago

I'm observing the same issue and not sometimes but for the every library I'm trying to parse using it. Without fixing it, seems like this node parses is useless. Error which I'm observing:

\venv\lib\site-packages\openai\_base_client.py", line 993, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8193 tokens (8193 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

seankim658 commented 1 month ago

@a-gorczew yeah I haven't played around too much with it after initially running into the chunk size issue. I think I tried it with some different breakpoint_percentile_threshold values but not much else besides that as its been low priority.

biocompute-objects / bco-rag

Semantic Chunking Chunk Size Bug #11