arthurpsmith / author-disambiguator

Wikidata service to help create or link author items to published articles
GNU General Public License v3.0
33 stars 8 forks source link

Korean names are not handled properly #190

Closed Daniel-Mietchen closed 8 months ago

Daniel-Mietchen commented 1 year ago

Here is what I am currently getting for 이상호: Screenshot 2023-10-26 at 08-02-06 Author Disambiguator

So somehow, 이상호 is mapped onto ???, and that paper has nobody named LEE Sang-Ho (or Sang-Ho Lee) on it.

Other Korean name strings, while surfacing some correct results, are also being mapped to those same three question marks from that very paper, e.g. 백승호 (Seung-Ho Park) or 류연규 (Yeon-Gu Ryu):

Screenshot 2023-10-26 at 08-15-46 Author Disambiguator

Screenshot 2023-10-26 at 08-15-57 Author Disambiguator

In addition, some of the suggested matches may also be completely off-target, e.g. 현철 김 (Hyeon-Cheol Kim) surfaces suggestions of unrelated Chinese names, along with further question mark suggestions: Screenshot 2023-10-26 at 08-25-14 Author Disambiguator

arthurpsmith commented 8 months ago

Thanks for raising this issue. The name handling up to now had been very latin-script-centric. I've made some small changes to recognize non-latin scripts, and then do searches for matches based on the likely associated language tags in Wikidata. However, it's rather crude - it will only match the same exact string as an author string or name for example. Maybe that's ok for CJK names; probably not for Cyrillic, Arabic or others, but I don't have a good grasp of how to do better on that for now. Anyway, I think this particular problem is now fixed, please let me know if you are still seeing issues here.