daltonfury42 / truecase

A python true casing utility that restores case information for texts
Apache License 2.0
87 stars 16 forks source link

bug report #16

Closed Brucewuzhang closed 4 years ago

Brucewuzhang commented 4 years ago

Thanks for making this repo.

bugs in __function_one and __function_two of Trainer.py.

Logical bug, I checked the original implementation of

https://github.com/nreimers/truecaser

He first goes through the whole corpus to get all casing info. But you are getting casing info on the fly, which means that when a casing appears for the first time for one lower token, its 2-gram and 3-gram statistics will never be counted. This is not desired for this algorithm.

daltonfury42 commented 4 years ago

I see. If you can send a PR, I can quickly merge and update. It would be really helpful

If you need me too take a look, it might take some time.

Thanks for pointing this out.

Brucewuzhang commented 4 years ago

Thanks for your reply.

I just sent you a pull request. I didn't optimize the codes to improve the speed. Just fixed this bug by going through the corpus 2 times.

This will check sanity twice for each sentence, which is OK when the corpus is small. But it will add processing time for big corpus. This can be handled by checking sanity for the corpus first (which can use multi-thread, actually, the whole training process can use multi-thread). If I have time, I will implement these and send another pull request.

daltonfury42 commented 4 years ago

Thanks a lot. I've merged your PR and pushed a new version to pypi