bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.68k stars 60 forks source link

Reconstruction Test #48

Closed mrmps closed 5 days ago

mrmps commented 6 days ago

This tests checks whether concatenating all the chunks results in the reconstructed text.

Currently 4 our 5 chunkers fail this test. But Token chunker passes!

bhavnicksm commented 6 days ago

Thanks for the tests, @mrmps!

Looking at why it fails now, I will let you know once I have a concrete fix...

Thanks!

bhavnicksm commented 6 days ago

Hey @mrmps,

I looked at the test and its a very good test!

To handle a complex markdown like the one in test, we would have to do a lot of fixes that are not conducive to the speed of Chonkie when it is not handling such complex markdown scenarios (by an order of 4 to 5 times!)

But of course, we can't neglect such scenarios in Chonkie, so I would make an alternate path to enable advanced features that would increase accuracy on complex scenarios, while maintaining a quick path which does fast chunking, to a decent accuracy.

From what I see, the differences are coming mostly from white spaces and new line characters missing between the input and the chunked output, which would get fixed in the fast path itself, so you would see improvements there too. These white spaces and new lines get removed during the splitting process.

What do you think about the idea?

bhavnicksm commented 5 days ago

@mrmps,

just added #53 which adds the required tests in various files and refactored the chunkers for the fix

Please check and let me know if it works for you!

Closing the PR for now, thanks!