HeardLibrary / vandycite

0 stars 0 forks source link

prepare to upload ETDs #75

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

Disambiguation and cleanup prior to writing all of the electronic theses and dissertations

baskaufs commented 2 years ago

Did this query to look for theses and dissertations by students at Vanderbilt: https://w.wiki/54fp Most results were pre-2000 and only one of the post-2000 ones was in the dataset we are working with.

Searched the WD QS for Handles and only got the one discovered above and our test write. See "Search for duplicates by querying for Handles" section of process_etd_data.ipynb

Monkey-wrenched the identifiers for Q111043885, which had already been written to prevent duplication of its claims. (Q111581666 had UUID and hash identifiers already from the write).

Did find and replace to change the dates to 2022-02-18 (Shenmeng's download date).

baskaufs commented 2 years ago

Removed four theses because their type couldn't be determined. In at least one case, it was a capstone, not a thesis. Saved in cleanup/removed_undetermined_type.csv

baskaufs commented 2 years ago

Also, had previously replaced regular double quotes in labels with single quotes (but not replaced in title) to prevent VanderBot from crashing when querying for duplicates. "Smart quotes" are OK and were not changed.

baskaufs commented 2 years ago

Other problems discovered during upload:

baskaufs commented 2 years ago

Decided to eliminate series ordinal for committee members since the first-listed committee member is not always the first listed on the title page and it is not guaranteed that even the first listed member is the chair. This is basically going to have to be some kind of student project or something to try to figure this out.