codeforkjeff / conciliator

OpenRefine reconciliation services for VIAF, ORCID, and Open Library + framework for creating more.
GNU General Public License v3.0
111 stars 22 forks source link

Full stop behind some VIAF person headings prevent automatic matching #23

Open ChristianeKlaes opened 3 years ago

ChristianeKlaes commented 3 years ago

Hi,

I'm using your VIAF recon service to reconcile scholar's names from the field of Lexicography and Dictionary Research, to contruct a domain bibliography and person registry in the Linked Open Data environment.

After reconciling and manually validating 200 person names with VIAF (and getting very good results in general!), I came across a peculiar feature in VIAF that seems to prevent automatic matching in many cases, and increases tedious manual validation. Apparently, one of the VIAF contributors, NUKAT, sets a full stop behind a person name heading, resulting in an otherwise non-existent edit distance and causing the score to drop below 1. Even with the selected option in OpenRefine to auto-match candidates with a high confidence during reconciling, the score is often below the threshold.

Typical example from my data:

Name literal: Quasthoff, Uwe VIAF candidate: Quasthoff, Uwe. (score: 0.933) VIAF URI: https://viaf.org/viaf/22741331/

As far as I can see, NUKAT ist the only VIAF contributor with a full stop behind a person's name, and yet this particular heading is always ranked highest in the VIAF cluster. As we have no way to anticipate whether a matching VIAF cluster includes NUKAT headings or not, is there a way to modify the matching algorithm and chop off the full stop (if it exists) for the candidates returned from VIAF?

This would really help to improve your VIAF recon service even further. Thanks for all the work you've already done!

Regards, Christiane

codeforkjeff commented 3 years ago

Hi! I hope to look into this and the other issue this weekend.

ChristianeKlaes commented 3 years ago

Hi,

I've done some more reconciling to VIAF and have come across an additional, related issue:

Some person name headings include disambiguating information like the birth year oder an occupation as a qualifier. These seem to be treated as part of the name literal, resulting in low scores:

grafik

In my database, I've got "Josselin-Leray, Amélie". VIAF recon service returns the correct match as a candidate with a score of only 0.786, when it should be 1.00 ...

Is there any way to eliminate those qualifiers before computing a matching score? At least in MARC21 format, qualifiers of a name are distinguished by their own subfield (in this instance, subfield code "d" - see all MARC21 specifications for personal name headings here: https://www.loc.gov/marc/authority/ad100.html)

grafik (taken from this person's SUDOC record within VIAF, http://viaf.org/processed/SUDOC%7C096134925)

Thanks a lot!

Christiane