Open tholzheim opened 2 years ago
The question is how we translate the editor who submitted the proceeding. We could additionally add this editor as editor-in-chief (P5769)
dblp has 2410
volumes with editors
PREFIX dblp: <https://dblp.org/rdf/schema#>
SELECT ?proceeding ?volume (group_concat(?editor;separator=";") as ?editors)
WHERE {
?proceeding dblp:publishedIn "CEUR Workshop Proceedings";
dblp:publishedInSeriesVolume ?volume;
dblp:editedBy ?editor.
}
GROUP BY ?proceeding ?volume
Over all volumes there are 4696
distinct editors See
Checking wikidata shows that for 1331
editors a wikidata item that is linked to dblp already exists.
For the remaining editors I need to check if a missing dblp author ID already indicates that the wikidata item itself does not exist for most of the cases.
For checking this I intend to use the disambiguation feature of OpenRefine to decide how to approach the creation/linking of the missing editors
1359
editors that are already linked to wikidata. 49
dblp editors have two wikidata ids
9
wikidata items could be merged as they represented the same person39
wikidata items were already merged1
dblp editor id represented actually two persons and thus the wikidata items were correct( 3601 →Q56459864 and Q51932541)
PREFIX dblp: <https://dblp.org/rdf/schema#>
SELECT DISTINCT ?editor ?editorName ?editor_wikidata
WHERE {
?proceeding dblp:publishedIn "CEUR Workshop Proceedings";
dblp:publishedInSeriesVolume ?volume;
dblp:editedBy ?editor.
?editor dblp:primaryCreatorName ?editorName.
?editor dblp:wikidata ?editor_wikidata .
}
Link between dblp and wikidata not always bidirectional → editor with wikidata id does not always mean that the item has the dblp author id defined and vice versa
To disambiguate the editors can use all available identifiers
PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX dblp: <https://dblp.org/rdf/schema#>
PREFIX litre: <http://purl.org/spar/literal/>
SELECT DISTINCT ?editor ?editorName ?editor_wikidata ?affiliation ?primaryhomepage ?identifier ?acm ?dblp ?gepris ?github ?gnd ?googleScholar ?ieee ?isni ?lattes ?linkedin ?loc ?mathGenealogy ?orcid ?researchGate ?scigraph ?twitter ?viaf ?wikidata ?zbmath
WHERE {
?proceeding dblp:publishedIn "CEUR Workshop Proceedings";
dblp:publishedInSeriesVolume ?volume;
dblp:editedBy ?editor.
?editor dblp:primaryCreatorName ?editorName.
OPTIONAL{?editor dblp:wikidata ?editor_wikidata .}
# orcid
OPTIONAL {
?editor datacite:hasIdentifier ?orcid_blank.
?orcid_blank datacite:usesIdentifierScheme datacite:orcid;
litre:hasLiteralValue ?orcid.
}
# google_scholar
OPTIONAL {
?editor datacite:hasIdentifier ?google_scholar_blank.
?google_scholar_blank datacite:usesIdentifierScheme datacite:google-scholar;
litre:hasLiteralValue ?googleScholar.
}
# acm
OPTIONAL {
?editor datacite:hasIdentifier ?acm_blank.
?acm_blank datacite:usesIdentifierScheme datacite:acm;
litre:hasLiteralValue ?acm.
}
# twitter
OPTIONAL {
?editor datacite:hasIdentifier ?twitter_blank.
?twitter_blank datacite:usesIdentifierScheme datacite:twitter;
litre:hasLiteralValue ?twitter.
}
# github
OPTIONAL {
?editor datacite:hasIdentifier ?github_blank.
?github_blank datacite:usesIdentifierScheme datacite:github;
litre:hasLiteralValue ?github.
}
# viaf
OPTIONAL {
?editor datacite:hasIdentifier ?viaf_blank.
?viaf_blank datacite:usesIdentifierScheme datacite:viaf;
litre:hasLiteralValue ?viaf.
}
# scigraph
OPTIONAL {
?editor datacite:hasIdentifier ?scigraph_blank.
?scigraph_blank datacite:usesIdentifierScheme datacite:scigraph;
litre:hasLiteralValue ?scigraph.
}
# zbmath
OPTIONAL {
?editor datacite:hasIdentifier ?zbmath_blank.
?zbmath_blank datacite:usesIdentifierScheme datacite:zbmath;
litre:hasLiteralValue ?zbmath.
}
# researchGate
OPTIONAL {
?editor datacite:hasIdentifier ?researchGate_blank.
?researchGate_blank datacite:usesIdentifierScheme datacite:research-gate;
litre:hasLiteralValue ?researchGate.
}
# mathGenealogy
OPTIONAL {
?editor datacite:hasIdentifier ?mathGenealogy_blank.
?mathGenealogy_blank datacite:usesIdentifierScheme datacite:math-genealogy;
litre:hasLiteralValue ?mathGenealogy.
}
# loc
OPTIONAL {
?editor datacite:hasIdentifier ?loc_blank.
?loc_blank datacite:usesIdentifierScheme datacite:loc;
litre:hasLiteralValue ?loc.
}
# linkedin
OPTIONAL {
?editor datacite:hasIdentifier ?linkedin_blank.
?linkedin_blank datacite:usesIdentifierScheme datacite:linkedin;
litre:hasLiteralValue ?linkedin.
}
# lattes
OPTIONAL {
?editor datacite:hasIdentifier ?lattes_blank.
?lattes_blank datacite:usesIdentifierScheme datacite:lattes;
litre:hasLiteralValue ?lattes.
}
# isni
OPTIONAL {
?editor datacite:hasIdentifier ?isni_blank.
?isni_blank datacite:usesIdentifierScheme datacite:isni;
litre:hasLiteralValue ?isni.
}
# ieee
OPTIONAL {
?editor datacite:hasIdentifier ?ieee_blank.
?ieee_blank datacite:usesIdentifierScheme datacite:ieee;
litre:hasLiteralValue ?ieee.
}
# gnd
OPTIONAL {
?editor datacite:hasIdentifier ?gnd_blank.
?gnd_blank datacite:usesIdentifierScheme datacite:gnd;
litre:hasLiteralValue ?gnd.
}
#gepris
OPTIONAL {
?editor datacite:hasIdentifier ?gepris_blank.
?gepris_blank datacite:usesIdentifierScheme datacite:gepris;
litre:hasLiteralValue ?gepris.
}
#homepage
OPTIONAL {?editor dblp:primaryHomepage ?primaryhomepage .}
OPTIONAL{?editor dblp:primaryAffiliation ?affiliation .}
@tholzheim this looks great. The disambiguation query suffers from the non truly tabular effect.
Unfortunately, all the identifiers seem not to help in this case.
identified: only one match for all provided ids conflict: multiple matches for all provided ids unknown: no match for all provided ids Note: dblp id is excluded since it is present for all authors
From this plot we can conclude that if at least one author id is in wikidata the link to dblp is already made (since the number of identified editors and already linked dblp author ids is nearly identical). dblp seems to sync their data with wikidata to complete the author data since we can not gain information coming from their direction (in terms of authors).
The conflict cases needs manual curation and for the unknown cases I will look into the email, affiliation and name to see how these properties can be used to make a correct match and to how error prone this matching would be...
To add the >3000
unidentified editors to wikidata with high confidence that their item currently does not exist we need additional information.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT
(COUNT(DISTINCT ?mail) as ?numberOfEmailAddresses)
(COUNT(DISTINCT ?homepage) as ?numberOfHomepages)
(COUNT(?employer) as ?numberOfEmployers)
(COUNT(?affiliation ) as ?numberOfAffiliations)
(COUNT(?residence ) as ?numberOfResidence)
WHERE{
?author wdt:P106 wd:Q1650915.
OPTIONAL{?author wdt:P968 ?mail}
OPTIONAL{?author wdt:P856 ?homepage}
OPTIONAL{?author wdt:P108 ?employer}
OPTIONAL{?author wdt:P1416 ?affiliation}
OPTIONAL{?author wdt:P551 ?residence}
}
Email Addresses | Homepages | employer entries | affiliations entries | residence entries |
---|---|---|---|---|
642 | 59828 | 935757 | 17597 | 2063 |
The result suggests that an editors homepage and affiliation (basically the same as employer) are useful information to verify if an item belongs to an editor/author if multiple matches or an unclear match exist.
For this reason I added a parser for the editor, editor homepage, affiliation and affiliation homepage to the VolumeParser.
See ef3114dca529be8e04188a14326df5aa205945de
A first test showed that for 97%
of the volumes 600-3225
these information could be extracted correctly.
In a next step it needs to be analyzed if these information help to identify more editors or to ensure that currently no entry exists.
For the volumes 600-3225
the parser extracted:
9891
editors7089
affiliations (with link to website if present)6201
editor homepagesduplicates are included since the records are not yet disambiguated
Reminder: dblp has
1333
editors linked to wikidata and3395
not linked to wikidata. In ceur-ws we have5294
distinct editor names (different writings of the name can link to the same editor) and 9912 raw editor records.
From the 3439
distinct affiliation websites 515
can be exactly matched against official website (P856).
By generating alternative website URLs the number of matched affiliations can be increased to 1180
.
For the url http://www.uni-kiel.de/ the following variants are generated and also queried
- http://www.uni-kiel.de/
- http://www.uni-kiel.de
- https://www.uni-kiel.de/
- https://www.uni-kiel.de
- http://uni-kiel.de/
- http://uni-kiel.de
- https://uni-kiel.de/
- https://uni-kiel.de
A match is then found for http://www.uni-kiel.de. This method has shown to be faster than using string contains or other string pattern functions.
Query
SELECT ?affiliation ?affiliationLabel ?homepage WHERE { VALUES ?homepage {} ?affiliation wdt:P856 ?homepage SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }
Label
The affiliation label often contains the country or other location information such as
University of Bremen, Cartesium, 28359 Bremen, Germany
Note: currently only the first part of the label is used. For the example this means we only search forUniversity of Bremen
. Matching the country, region, city, and/or address could increase and validate the match but leads to a more complex query and procedure and is thus for the time being omitted.Query
SELECT DISTINCT ?affiliation ?affiliationLabel ?inputLabel WHERE { VALUES ?inputLabel {%s}
?affiliation rdfs:label|skos:altLabel ?inputLabel; ^wdt:P108 ?editor. # item has employees SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }
## Result
```mermaid
pie showData
title Matching of affiliations against Wikidata items
"by label" : 492
"by website" : 503
"by label and website" : 677
"no match" : 1771
pie showData
title Matched affiliations of editor records
"editor record with matched affiliation" : 5099
"editor record without matched affiliation" : 4813
Note: here all editor records are considered and not the distinct editor names since the affiliations can differ and thus link to different persons.
Editors
From the 4225
distinct editor websites
164
can be matched against official website (P856)50
are orcid linksFor each editor that is not already linked to wikidata query for the name and try to narrow down the results by including the affiliation homepage.
Extraction of the editor names and dblp author Id lookup. The editors need to be added to the proceedings as editor. Additionally the order of the editors needs to be preserved by the qualifier series ordinal