Open suptejas opened 2 weeks ago
Hi @suptejas it makes me really happy to hear that this crate has been helpful for you! Thanks for reaching out.
I haven't tested this myself, but I think it is highly likely with a chunk size of 8192 and the gpt tokenizer that this entire code file would fit inside that.
It is important to note that this is a greedy algorithm by default, it tries to pack as many elements as it can while in the chunk size. Maybe you were expecting similar functionality to this user? https://github.com/benbrandt/text-splitter/discussions/226 Where it would return all top-level items and only split if they are too big?
You might be able to do something similar by using the range syntax with a lower desired and a high max size (something like 256..8192
which might return similar results.. But I may need to have a way to not be greedy by default.
Feel free to let me know more of what you are looking for and I can see if I can accommodate, because I think your use case is a likely one. Thanks again!
Thanks for the quick response! That's right, my expectation was to receive top-level items chunked like functions, classes etc. since I felt that would be helpful to have more semantically concise information passing into the embeddings model.
I'm also currently working on benchmarking code search results w/ different embedding methods, so if it's useful I'm happy to report back on which method (greedy or semantic splitting) works better for a code search use case (which is probably one of the most common ones for using CodeSplitter
I presume).
Ok this was the second request for it in a short time, so I think I need to find a better way to support this flow :)
And I would love to hear any results you want to share for your use case and what you find to be beneficial, and if there is anything I can do with this crate to help support what you are trying to do.
Thanks again!
Hey Ben,
First off, thanks for your work on this incredible library. It's enabled us to achieve substantially better embedding results in our search pipeline.
When trying to embed a file, say the below one:
When using
CodeSplitter
as follows:I get the following output:
Is this expected? It works as it should for smaller chunk sizes like 512 and 1024, but doesn't seem to do any sort of chunking for a larger chunking token limit.
As you can see, it's not actually chunking the code in this case, it's just basically returning back the entire file's contents. I would somewhat expect the output to still be separating functions, imports, etc, but I'm not sure if that's the same goal you had in mind.
I'd really appreciate clarification on the same. Thanks so much.