aorwall / moatless-tools

MIT License
216 stars 20 forks source link

EpicSplitter not following max_chunk_size #8

Open zkx06111 opened 2 months ago

zkx06111 commented 2 months ago

I further looked into the implementation of EpicSplitter. It seems that the chunking process is not really ensuring all chunks are smaller than max_chunk_size. Is this supposed to be like this or is this an error?

For one instance, when building the index for sympy__sympy-11870, in the _chunk_block function of EpicSplitter, when the input isfile_path == 'sympy/combinatorics/permutations.py' and codeblock.content == 'class Permutation(Basic):', the first chunk being appended to chunks has 3333 tokens, even though self.max_chunk_size is 1500, self.hard_token_limit is 2000, self.chunk_size is 750.

Specifically, this 3333-token chunk is appended by this line: https://github.com/aorwall/moatless-tools/blob/bda099dfc2dbab34a36bf587f27f4ae75d620d3e/moatless/index/epic_split.py#L266

It seems to me that this part of code is recursively chunking the child. If that's correct, do we still need the parent when the child will be indexed separately?

aorwall commented 2 months ago

The solution is a bit over and underengineered at the same time :sweat_smile: The idea was to group small chunks (classes and methods) into the same vector. The different token limits are a bit vague as you already noticed. The idea was that max_chunk_size and chunk_size would be the soft limits and to avoid errors when embedding I added hard_token_limit. But as you also noticed this in another issue this is not always handled properly.

I plan to simplify this by do chunking on method level and truncate methods larger than the hard_token_limit. If the parent is a class the class signature, instance variables and constructors could be in a separate chunk.

zkx06111 commented 2 months ago

happy to help with that and chat further! we can talk on zoom or whatever