code-openness / Data

The pre-processing and formatting of the data to setup the Wikidata instance
0 stars 0 forks source link

Re-design the data model #19

Closed AbdBarho closed 5 years ago

AbdBarho commented 5 years ago

depends on #18 and #17

hannahtro commented 5 years ago

Data model based on wikidata model:

wikidata

! non-commented arrows signify 'subclass of'

problems:

hannahtro commented 5 years ago

wikidata

AbdBarho commented 5 years ago

research wikidata data model to find corresponding properties to match our columns' names and publications.

hannahtro commented 5 years ago
Frequency Column in PIK data set Wikidata property Comment
8261 title title P1476
8258 keywords
8235 year publication date P577
8080 authors author P50
7796 publisher publisher P123
6299 startpage number of pages P1104 merge startpage and endpage, map to number of pages
6034 endpage number of pages P1104
4493 journal academic journal Q737498 only paperr, papern, newspaper and inbook have entry for journal; link article to journal
4468 x4 ( = DOI / Identifier) DOI P356
3879 vol volume P478
3462 issue issue P433
2922 place place of publication P291
1656 editors editor P98
1516 booktitle only inbook, inreport, confpaper, proceedings, epup have entry for booktitle
1340 relation (= Serie) part of the Series P179
974 link
921 comment
385 conference
hannahtro commented 5 years ago

Questions: Where do we add P1433 venue (published in (not place))? Where do we add P921 topic (main subject)? What do we do with missing PIK properties?

AbdBarho commented 5 years ago

P1433: There some inconsistencies in how the data looks like and how Scholia requests it. for example, in Scholia, in author.html we see the following request:

?work wdt:P1433 ?venue .

where as on the official page of P1433, the description of the item says : larger work that a given work was published in, like a book, journal or music album. then again, a venue is the physical place where it was published, but in this case it is used as part of, maybe it is just a naming problem.

P921: we have the column keywordsAndPeerReview (also named x1 ( =Feld ""Keyword""; u.a. belegt mit Info zu peer-review, wenn kein ISI-Journal)) which might be a good candidate.

For the most part we can ignore the remaining PIK properties to deliver the first prototype at time, additional input from PIK is needed for how important is this information and how it ties with the other values we have.

AbdBarho commented 5 years ago

author inconsistencies: taking the following example query for the author Didier Musso:

select ?work where {
  ?work wdt:P50 wd:Q24244119 .
}

we can see that has published some work and Scholia would recongnize him as an author, however, on his wikidata page, we find no mention of the class author anywhere. the only link is through the occupation property which has the value researcher, which in itself a subclass of creator (author is also a subclass of creator)

This further leads to the assumption that all the data model is build on relationships between the different items instead of class hierarchies.

AbdBarho commented 5 years ago

Initial draft of the new data model

Publications

a publication is an instance of (P31) one of the following items:

Authors & Editors

Others