Open zkx06111 opened 5 months ago
The solution is a bit over and underengineered at the same time :sweat_smile: The idea was to group small chunks (classes and methods) into the same vector. The different token limits are a bit vague as you already noticed. The idea was that max_chunk_size
and chunk_size
would be the soft limits and to avoid errors when embedding I added hard_token_limit
. But as you also noticed this in another issue this is not always handled properly.
I plan to simplify this by do chunking on method level and truncate methods larger than the hard_token_limit
. If the parent is a class the class signature, instance variables and constructors could be in a separate chunk.
happy to help with that and chat further! we can talk on zoom or whatever
I further looked into the implementation of EpicSplitter. It seems that the chunking process is not really ensuring all chunks are smaller than
max_chunk_size
. Is this supposed to be like this or is this an error?For one instance, when building the index for
sympy__sympy-11870
, in the_chunk_block
function of EpicSplitter, when the input isfile_path == 'sympy/combinatorics/permutations.py' and codeblock.content == 'class Permutation(Basic):'
, the first chunk being appended tochunks
has 3333 tokens, even thoughself.max_chunk_size
is 1500,self.hard_token_limit
is 2000,self.chunk_size
is 750.Specifically, this 3333-token chunk is appended by this line: https://github.com/aorwall/moatless-tools/blob/bda099dfc2dbab34a36bf587f27f4ae75d620d3e/moatless/index/epic_split.py#L266
It seems to me that this part of code is recursively chunking the child. If that's correct, do we still need the parent when the child will be indexed separately?