HeardLibrary / vandycite

0 stars 0 forks source link

Research for next VandyCite meeting March 14 #68

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

Investigate the following:

baskaufs commented 2 years ago

The files from Shenmeng (email 2022-02-18 filed in vandycite) are in /projects/wikidata/publications/ir/ etd_metadata_list.xlsx and etd_metadata.csv

baskaufs commented 2 years ago

Notes from Chris Benda 2022-02-21 emails:

I was looking at some of the thesis data in the Excel file Shenmeng sent out, and I notice some date complications:

Earlier theses have the date of the thesis in dc.date.issued[] (and sometimes also in dc.date.available[], though these two dates can be different), while later theses are using other DC fields. This seems to be an issue of mapping. Some of the dates for the earlier theses don’t match the dates on the title pages of the items themselves. See, e.g., row 407 on the Excel file (id 4f829aaf-6f7c-484a-9bdc-6754ce96cdd1): the three dates are dc.date.accessioned (2020-08-21T20:55:36Z), dc.date.available[] (1/18/2010), and dc.date.issued[] (also 1/18/2010). The date of “publication” according to the title page is May 2010 (see http://hdl.handle.net/1803/10424). This date is correctly reflected in the WorldCat record for this thesis: OCLC #500908449). (Many if not most of the earlier theses are on WorldCat, but I understand a decision was made to no longer add thesis records to WorldCat.) This is an issue of data.

I thought I’d mention this here rather than getting into the weeds in our upcoming meeting. The first point may be a Wikidata-related issue, but the second is more a question of data accuracy.


FWIW: here are the Scholia pages for VU Religious Studies faculty in Wikidata (view the whole list of RS faculty to see who’s missing):

Richard McGregor

Adeana McNicholl

Laurel Schneider

Anand Vivek Taneja

Tony K. Stewart (emeritus)

Alexis Wells-Oghoghomeh (now at Stanford)

Perhaps not half, but I’m not counting Rachel Heath (a Vandy PhD student), Ira Helderman (visiting professor; his thesis data is probably in the IR), Teresa Smallwood (she now works for an institution in Philly, I think), and Juan Floyd Thomas (part of the Div School; Div School faculty generally haven’t been added to Wikidata).

baskaufs commented 2 years ago

SPARQL query for instance of dissertation Q1385450 produced 234 results. instance of thesis Q1266946 produced 16770 results. https://w.wiki/4wyv

Tried to run count_entities.py with this query, but it timed out. Instead, downloaded the result and ran the script using the first 200 Q IDs. Results: image

Note that "thesis" Q1266946 is much more widely used for P31 than dissertation. The thesis subclasses are also used as second values (doctoral thesis, masters thesis, etc.). Other properties used are those typically used for academic books (such as "book" or "version edition or translation".

Ran a similar query https://w.wiki/4w$M that just looked for doctoral theses and masters theses. It got 39 000 hits.

Ran exhaustive query https://w.wiki/4w$P to look for all kinds of theses and dissertations and got 56 000 hits.

Ran a modified query https://w.wiki/4w$S for authors educated at Vanderbilt University and got 11 hits. Nine were actually submitted to Vanderbilt. Greg created a lot of the Quarterman student theses and Jeff did some others.

The Handle property is "Handle ID" P1184 and it's used on the thesis https://www.wikidata.org/wiki/Q111043885

Note that this example also uses the properties "thesis committee member" P9161 as a property of the thesis itself. Used this query https://w.wiki/4w$A to discover the types of things with which this property is used. They are nearly always a doctoral or masters thesis.

https://www.wikidata.org/wiki/Q90725428 uses "student of" as a qualifier of the author to link the author to the advisor.

baskaufs commented 2 years ago

Created date vs. issued data: earliest one is the publication date, the later date is the archived date qualifier for the handle. Chris noted that Dublin Core has specific terms with definitions, so we could go by that.

baskaufs commented 2 years ago

Chris suggests just using the publication year to avoid the problem of date inconsistencies.

baskaufs commented 2 years ago

Steve will look into the author disambituation process and at the next meeting demo the mapping.

baskaufs commented 2 years ago

If there are is an local.embargo.lift is in the future, don't use the property "full text available". If it's in the past, then we use full text available.