WolfgangFahl / pyCEURmake

CEUR make python implementation
Apache License 2.0
1 stars 1 forks source link

Extraction of Editors #35

Open tholzheim opened 2 years ago

tholzheim commented 2 years ago

Extraction of the editor names and dblp author Id lookup. The editors need to be added to the proceedings as editor. Additionally the order of the editors needs to be preserved by the qualifier series ordinal

as example see https://www.wikidata.org/wiki/Q113531126

tholzheim commented 2 years ago

The question is how we translate the editor who submitted the proceeding. We could additionally add this editor as editor-in-chief (P5769)

tholzheim commented 2 years ago

dblp has 2410 volumes with editors

PREFIX dblp: <https://dblp.org/rdf/schema#>
SELECT ?proceeding ?volume (group_concat(?editor;separator=";") as ?editors)
WHERE {
  ?proceeding dblp:publishedIn "CEUR Workshop Proceedings";
              dblp:publishedInSeriesVolume ?volume;
              dblp:editedBy ?editor.
}
GROUP BY ?proceeding ?volume 

Try it!

Over all volumes there are 4696 distinct editors See

Checking wikidata shows that for 1331 editors a wikidata item that is linked to dblp already exists. For the remaining editors I need to check if a missing dblp author ID already indicates that the wikidata item itself does not exist for most of the cases.

For checking this I intend to use the disambiguation feature of OpenRefine to decide how to approach the creation/linking of the missing editors

tholzheim commented 2 years ago

Link between dblp and wikidata not always bidirectional → editor with wikidata id does not always mean that the item has the dblp author id defined and vice versa

tholzheim commented 2 years ago

To disambiguate the editors can use all available identifiers

PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX dblp: <https://dblp.org/rdf/schema#>
PREFIX litre: <http://purl.org/spar/literal/>
SELECT DISTINCT ?editor ?editorName ?editor_wikidata ?affiliation ?primaryhomepage ?identifier ?acm ?dblp ?gepris ?github ?gnd ?googleScholar ?ieee ?isni ?lattes ?linkedin ?loc ?mathGenealogy ?orcid ?researchGate ?scigraph ?twitter ?viaf ?wikidata ?zbmath 
WHERE {
    ?proceeding dblp:publishedIn "CEUR Workshop Proceedings";
        dblp:publishedInSeriesVolume ?volume;
        dblp:editedBy ?editor.
    ?editor dblp:primaryCreatorName ?editorName.
    OPTIONAL{?editor dblp:wikidata ?editor_wikidata .}
    # orcid
    OPTIONAL {
            ?editor datacite:hasIdentifier ?orcid_blank.
        ?orcid_blank datacite:usesIdentifierScheme datacite:orcid;
            litre:hasLiteralValue ?orcid.
    }
    # google_scholar
    OPTIONAL {
            ?editor datacite:hasIdentifier ?google_scholar_blank.
        ?google_scholar_blank datacite:usesIdentifierScheme datacite:google-scholar;
            litre:hasLiteralValue ?googleScholar.
    }
    # acm
    OPTIONAL {
            ?editor datacite:hasIdentifier ?acm_blank.
        ?acm_blank datacite:usesIdentifierScheme datacite:acm;
            litre:hasLiteralValue ?acm.
    }
    # twitter
    OPTIONAL {
            ?editor datacite:hasIdentifier ?twitter_blank.
        ?twitter_blank datacite:usesIdentifierScheme datacite:twitter;
            litre:hasLiteralValue ?twitter.
    }
    # github
    OPTIONAL {
            ?editor datacite:hasIdentifier ?github_blank.
        ?github_blank datacite:usesIdentifierScheme datacite:github;
            litre:hasLiteralValue ?github.
    }
    # viaf
    OPTIONAL {
            ?editor datacite:hasIdentifier ?viaf_blank.
        ?viaf_blank datacite:usesIdentifierScheme datacite:viaf;
            litre:hasLiteralValue ?viaf.
    }
    # scigraph
    OPTIONAL {
            ?editor datacite:hasIdentifier ?scigraph_blank.
        ?scigraph_blank datacite:usesIdentifierScheme datacite:scigraph;
            litre:hasLiteralValue ?scigraph.
    }
    # zbmath
    OPTIONAL {
            ?editor datacite:hasIdentifier ?zbmath_blank.
        ?zbmath_blank datacite:usesIdentifierScheme datacite:zbmath;
            litre:hasLiteralValue ?zbmath.
    }
    # researchGate
    OPTIONAL {
            ?editor datacite:hasIdentifier ?researchGate_blank.
        ?researchGate_blank datacite:usesIdentifierScheme datacite:research-gate;
            litre:hasLiteralValue ?researchGate.
    }
    # mathGenealogy
    OPTIONAL {
            ?editor datacite:hasIdentifier ?mathGenealogy_blank.
        ?mathGenealogy_blank datacite:usesIdentifierScheme datacite:math-genealogy;
            litre:hasLiteralValue ?mathGenealogy.
    }
    # loc
    OPTIONAL {
            ?editor datacite:hasIdentifier ?loc_blank.
        ?loc_blank datacite:usesIdentifierScheme datacite:loc;
            litre:hasLiteralValue ?loc.
    }
    # linkedin
    OPTIONAL {
            ?editor datacite:hasIdentifier ?linkedin_blank.
        ?linkedin_blank datacite:usesIdentifierScheme datacite:linkedin;
            litre:hasLiteralValue ?linkedin.
    }
    # lattes
    OPTIONAL {
            ?editor datacite:hasIdentifier ?lattes_blank.
        ?lattes_blank datacite:usesIdentifierScheme datacite:lattes;
            litre:hasLiteralValue ?lattes.
    }
    # isni
    OPTIONAL {
            ?editor datacite:hasIdentifier ?isni_blank.
        ?isni_blank datacite:usesIdentifierScheme datacite:isni;
            litre:hasLiteralValue ?isni.
    }
    # ieee 
    OPTIONAL {
            ?editor datacite:hasIdentifier ?ieee_blank.
        ?ieee_blank datacite:usesIdentifierScheme datacite:ieee;
            litre:hasLiteralValue ?ieee.
    }
    # gnd
    OPTIONAL {
            ?editor datacite:hasIdentifier ?gnd_blank.
        ?gnd_blank datacite:usesIdentifierScheme datacite:gnd;
            litre:hasLiteralValue ?gnd.
    }
    #gepris
    OPTIONAL {
            ?editor datacite:hasIdentifier ?gepris_blank.
        ?gepris_blank datacite:usesIdentifierScheme datacite:gepris;
            litre:hasLiteralValue ?gepris.
    }
    #homepage
    OPTIONAL {?editor dblp:primaryHomepage ?primaryhomepage .}
    OPTIONAL{?editor dblp:primaryAffiliation ?affiliation .}

Try it!

WolfgangFahl commented 2 years ago

@tholzheim this looks great. The disambiguation query suffers from the non truly tabular effect.

tholzheim commented 2 years ago

Unfortunately, all the identifiers seem not to help in this case. image

identified: only one match for all provided ids conflict: multiple matches for all provided ids unknown: no match for all provided ids Note: dblp id is excluded since it is present for all authors

From this plot we can conclude that if at least one author id is in wikidata the link to dblp is already made (since the number of identified editors and already linked dblp author ids is nearly identical). dblp seems to sync their data with wikidata to complete the author data since we can not gain information coming from their direction (in terms of authors).

The conflict cases needs manual curation and for the unknown cases I will look into the email, affiliation and name to see how these properties can be used to make a correct match and to how error prone this matching would be...

tholzheim commented 2 years ago

To add the >3000 unidentified editors to wikidata with high confidence that their item currently does not exist we need additional information.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT 
  (COUNT(DISTINCT ?mail) as ?numberOfEmailAddresses)
  (COUNT(DISTINCT ?homepage) as ?numberOfHomepages)
  (COUNT(?employer) as ?numberOfEmployers)
  (COUNT(?affiliation ) as ?numberOfAffiliations)
  (COUNT(?residence ) as ?numberOfResidence)
WHERE{
  ?author wdt:P106 wd:Q1650915.
  OPTIONAL{?author wdt:P968 ?mail}
  OPTIONAL{?author wdt:P856 ?homepage}
  OPTIONAL{?author wdt:P108 ?employer}
  OPTIONAL{?author wdt:P1416 ?affiliation}
  OPTIONAL{?author wdt:P551 ?residence}
}

Try it!

Email Addresses Homepages employer entries affiliations entries residence entries
642 59828 935757 17597 2063

The result suggests that an editors homepage and affiliation (basically the same as employer) are useful information to verify if an item belongs to an editor/author if multiple matches or an unclear match exist.

For this reason I added a parser for the editor, editor homepage, affiliation and affiliation homepage to the VolumeParser. See ef3114dca529be8e04188a14326df5aa205945de A first test showed that for 97% of the volumes 600-3225 these information could be extracted correctly. In a next step it needs to be analyzed if these information help to identify more editors or to ensure that currently no entry exists.

For the volumes 600-3225 the parser extracted:

duplicates are included since the records are not yet disambiguated

tholzheim commented 2 years ago

Reminder: dblp has 1333 editors linked to wikidata and 3395 not linked to wikidata. In ceur-ws we have 5294 distinct editor names (different writings of the name can link to the same editor) and 9912 raw editor records.

Affiliations

Official Website

From the 3439 distinct affiliation websites 515 can be exactly matched against official website (P856). By generating alternative website URLs the number of matched affiliations can be increased to 1180.

For the url http://www.uni-kiel.de/ the following variants are generated and also queried

A match is then found for http://www.uni-kiel.de. This method has shown to be faster than using string contains or other string pattern functions.

Query

SELECT ?affiliation ?affiliationLabel ?homepage
WHERE 
{
VALUES ?homepage {}
?affiliation wdt:P856 ?homepage
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Label

The affiliation label often contains the country or other location information such as University of Bremen, Cartesium, 28359 Bremen, Germany Note: currently only the first part of the label is used. For the example this means we only search for University of Bremen. Matching the country, region, city, and/or address could increase and validate the match but leads to a more complex query and procedure and is thus for the time being omitted.

Query


SELECT DISTINCT ?affiliation ?affiliationLabel ?inputLabel
WHERE 
{
VALUES ?inputLabel {%s}

?affiliation rdfs:label|skos:altLabel ?inputLabel; ^wdt:P108 ?editor. # item has employees SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }

## Result
```mermaid
pie showData
    title Matching of affiliations against Wikidata items
    "by label" : 492
    "by website" : 503
    "by label and website" : 677
    "no match" :  1771
pie showData
    title Matched affiliations of editor records
    "editor record with matched affiliation" : 5099
    "editor record without matched affiliation" : 4813

Note: here all editor records are considered and not the distinct editor names since the affiliations can differ and thus link to different persons.

Editors

From the 4225 distinct editor websites

For each editor that is not already linked to wikidata query for the name and try to narrow down the results by including the affiliation homepage.