caltechlibrary / iga

IGA is the InvenioRDM GitHub Archiver, a standalone program as well as a GitHub Action that lets you automatically archive GitHub software releases in an InvenioRDM repository.
https://caltechlibrary.github.io/iga/
Other
8 stars 1 forks source link

`is_person` may produce misleading results for strings containing CJK characters #14

Closed mhucka closed 4 months ago

mhucka commented 1 year ago

is_person() in name_utils.py will return False if a name string contains all-CJK characters. At the time I wrote it, it was done this way because the name checkers like ProbablePeople can't handle CJK. However, it's obviously wrong if the string really is a human name.

mhucka commented 4 months ago

A partial fix is now in the dev branch and will be in the upcoming 1.3.0 release. The new implementation of is_person() is not very accurate when it comes to names in CJK scripts, but it is still better than the current situation (which is that it always returns False for CJK names).

Solving this problem properly turns out to be very difficult. I wish I could do something better than the current weak, home-grown heuristics. Unfortunately, this appears to be a research-grade problem that no one has solved. Even the best AI systems today can't reliable tell you if, say, a given 1-3 character sequence in Chinese is the name of a person.

The current solution may be as good as we can get for now. I'm going to close this issue because it is unlikely that I can devote more time on this matter.