Closed Brucewuzhang closed 4 years ago
I see. If you can send a PR, I can quickly merge and update. It would be really helpful
If you need me too take a look, it might take some time.
Thanks for pointing this out.
Thanks for your reply.
I just sent you a pull request. I didn't optimize the codes to improve the speed. Just fixed this bug by going through the corpus 2 times.
This will check sanity twice for each sentence, which is OK when the corpus is small. But it will add processing time for big corpus. This can be handled by checking sanity for the corpus first (which can use multi-thread, actually, the whole training process can use multi-thread). If I have time, I will implement these and send another pull request.
Thanks a lot. I've merged your PR and pushed a new version to pypi
Thanks for making this repo.
bugs in __function_one and __function_two of Trainer.py.
Logical bug, I checked the original implementation of
https://github.com/nreimers/truecaser
He first goes through the whole corpus to get all casing info. But you are getting casing info on the fly, which means that when a casing appears for the first time for one lower token, its 2-gram and 3-gram statistics will never be counted. This is not desired for this algorithm.