Closed baskaufs closed 2 years ago
The files from Shenmeng (email 2022-02-18 filed in vandycite) are in /projects/wikidata/publications/ir/ etd_metadata_list.xlsx
and etd_metadata.csv
Notes from Chris Benda 2022-02-21 emails:
I was looking at some of the thesis data in the Excel file Shenmeng sent out, and I notice some date complications:
Earlier theses have the date of the thesis in dc.date.issued[] (and sometimes also in dc.date.available[], though these two dates can be different), while later theses are using other DC fields. This seems to be an issue of mapping. Some of the dates for the earlier theses don’t match the dates on the title pages of the items themselves. See, e.g., row 407 on the Excel file (id 4f829aaf-6f7c-484a-9bdc-6754ce96cdd1): the three dates are dc.date.accessioned (2020-08-21T20:55:36Z), dc.date.available[] (1/18/2010), and dc.date.issued[] (also 1/18/2010). The date of “publication” according to the title page is May 2010 (see http://hdl.handle.net/1803/10424). This date is correctly reflected in the WorldCat record for this thesis: OCLC #500908449). (Many if not most of the earlier theses are on WorldCat, but I understand a decision was made to no longer add thesis records to WorldCat.) This is an issue of data.
I thought I’d mention this here rather than getting into the weeds in our upcoming meeting. The first point may be a Wikidata-related issue, but the second is more a question of data accuracy.
FWIW: here are the Scholia pages for VU Religious Studies faculty in Wikidata (view the whole list of RS faculty to see who’s missing):
Tony K. Stewart (emeritus)
Alexis Wells-Oghoghomeh (now at Stanford)
Perhaps not half, but I’m not counting Rachel Heath (a Vandy PhD student), Ira Helderman (visiting professor; his thesis data is probably in the IR), Teresa Smallwood (she now works for an institution in Philly, I think), and Juan Floyd Thomas (part of the Div School; Div School faculty generally haven’t been added to Wikidata).
SPARQL query for instance of dissertation Q1385450 produced 234 results. instance of thesis Q1266946 produced 16770 results. https://w.wiki/4wyv
Tried to run count_entities.py with this query, but it timed out. Instead, downloaded the result and ran the script using the first 200 Q IDs. Results:
Note that "thesis" Q1266946 is much more widely used for P31 than dissertation. The thesis subclasses are also used as second values (doctoral thesis, masters thesis, etc.). Other properties used are those typically used for academic books (such as "book" or "version edition or translation".
Ran a similar query https://w.wiki/4w$M that just looked for doctoral theses and masters theses. It got 39 000 hits.
Ran exhaustive query https://w.wiki/4w$P to look for all kinds of theses and dissertations and got 56 000 hits.
Ran a modified query https://w.wiki/4w$S for authors educated at Vanderbilt University and got 11 hits. Nine were actually submitted to Vanderbilt. Greg created a lot of the Quarterman student theses and Jeff did some others.
The Handle property is "Handle ID" P1184 and it's used on the thesis https://www.wikidata.org/wiki/Q111043885
Note that this example also uses the properties "thesis committee member" P9161 as a property of the thesis itself. Used this query https://w.wiki/4w$A to discover the types of things with which this property is used. They are nearly always a doctoral or masters thesis.
https://www.wikidata.org/wiki/Q90725428 uses "student of" as a qualifier of the author to link the author to the advisor.
Created date vs. issued data: earliest one is the publication date, the later date is the archived date qualifier for the handle. Chris noted that Dublin Core has specific terms with definitions, so we could go by that.
Chris suggests just using the publication year to avoid the problem of date inconsistencies.
Steve will look into the author disambituation process and at the next meeting demo the mapping.
If there are is an local.embargo.lift is in the future, don't use the property "full text available". If it's in the past, then we use full text available.
Investigate the following: