MaRDI4NFDI / docker-importer

Import data from external data sources into the portal
https://mardi4nfdi.github.io/docker-importer
0 stars 0 forks source link

Author disambiguation #77

Open eloiferrer opened 1 year ago

eloiferrer commented 1 year ago

Issue description: The current importers (CRAN, zbMath, polyDB) create entities for authors using ORCID ID, zbMath ID or no identifier. For the cases in which an identifier exists, authors might have been created more than once by different importers. Duplicate authors should be identified, merged and completed with information from Wikidata. The dataset mentioned here (https://github.com/MaRDI4NFDI/portal-compose/issues/344) can be useful for the task.

TODOS:

Acceptance-Criteria

Checklist for this issue:

eloiferrer commented 5 months ago

The ORCID for all the zbmath authors in https://zenodo.org/records/7378860 have been inserted.

Current statistics in the KG:

Next step: get Wikidata QID for as many humans as possible:

eloiferrer commented 5 months ago

Given the zbMath ID I have matched them to items available in Wikidata. Only ~5% of the zbMath authors exist in Wikidata (with the zbmath identifier). For those where an ORCID was present, it has also been imported.

Current statistics:

eloiferrer commented 5 months ago

I've imported further Wikidata QIDs given the current ORCID in the KG. I've also merge several authors that had the same ORCID ID.

Current statistics:

eloiferrer commented 5 months ago

Wikidata has author items that contain two zbMath IDs. For most of the cases this is wrong, which leads to our knowledge graph having the same Wikidata QID for two different zbmath authors. See cases here: http://tinyurl.com/27d65qov This would require some manual disambiguation.