covidgraph / motherlode

Pipeline for running all dataloader scripts for covidgraph in a controlled manner.
https://covidgraph.org
MIT License
3 stars 1 forks source link

Author Disambiguation #25

Open keith-ingentium opened 4 years ago

keith-ingentium commented 4 years ago

The authors on references are duplicated for each reference node, and should be unified across all references. Thus, with a single author node, one should have links to all references that that person is an author on.
covidgraph/documentation#1

motey commented 4 years ago

Atm every :Author node has the property _hash_id which is a md5 hash of all other properties of an :Author node. Based on _hash_id the :Author nodes are merged. In result, when an :Author has the same properties, there will be no duplicate.

Every "duplicate" is based on poor source data. This is usually the result of Authors using different representation of their names (or being references with a different representation of their names)

On the other side of the spectrum this results in the problem, that Authors with the same name (e.g. a common name like Tom Miller) are merged to one Author atm.

To bypass these problems Authors can attach an Orcid ID to their papers. This is done more and more by authors nowadays but unfortunately orcid IDs are missing in the CORD19 dataset.

One could improve the current situation, by creating a new data source script which matches papers against pubmed data and try to obtain more detailed author data from there.

As the author name representations in the references in the CORD19 data is very poor, this data will be dropped with the next datamodel release anyway.

mpreusse commented 4 years ago

Some additional ideas from the Matrix chat:

The disambiguation problem is a big one for any graph project. The CORD19 dataset didnt even include links to Pubmed, so it exacerbates the problem. I think that what is called for is the ability to preprocess papers to disambiguate authors against a standard database, If you look through the wikipedia entry on author disambiguation (https://en.wikipedia.org/wiki/Author_name_disambiguation) you will see two efforts at building this reference database - AMiner and CiteSeer. For this limited dataset, I think we could build a disambiguated database, and process all the literature references through the pipeline to disambiguate the authors... would be an intersting project to work on , plus would provide real value to CovidGraph.

amalic commented 4 years ago

I used Springer's SciGraph in the past which contains links between persons and organisations. Don't forget to consider that a person switches organisations over time.

see: SciGraph Ontology

more: scigraph/docs/jsonld/examples/person.jsonld

keith-ingentium commented 4 years ago

Just took a quick look at the data they make available for download. Not sure how useful it is. We may need to develop a database on our own, that is specific to the COVID authors, and can rely on information on institutions, co-authors, etc. in the COVID-19 dataset.