Closed mrmps closed 5 days ago
Thanks for the tests, @mrmps!
Looking at why it fails now, I will let you know once I have a concrete fix...
Thanks!
Hey @mrmps,
I looked at the test and its a very good test!
To handle a complex markdown like the one in test, we would have to do a lot of fixes that are not conducive to the speed of Chonkie when it is not handling such complex markdown scenarios (by an order of 4 to 5 times!)
But of course, we can't neglect such scenarios in Chonkie, so I would make an alternate path to enable advanced features that would increase accuracy on complex scenarios, while maintaining a quick path which does fast chunking, to a decent accuracy.
From what I see, the differences are coming mostly from white spaces and new line characters missing between the input and the chunked output, which would get fixed in the fast path itself, so you would see improvements there too. These white spaces and new lines get removed during the splitting process.
What do you think about the idea?
@mrmps,
just added #53 which adds the required tests in various files and refactored the chunkers for the fix
Please check and let me know if it works for you!
Closing the PR for now, thanks!
This tests checks whether concatenating all the chunks results in the reconstructed text.
Currently 4 our 5 chunkers fail this test. But Token chunker passes!