Closed baskaufs closed 2 years ago
Did this query to look for theses and dissertations by students at Vanderbilt: https://w.wiki/54fp Most results were pre-2000 and only one of the post-2000 ones was in the dataset we are working with.
Searched the WD QS for Handles and only got the one discovered above and our test write. See "Search for duplicates by querying for Handles" section of process_etd_data.ipynb
Monkey-wrenched the identifiers for Q111043885, which had already been written to prevent duplication of its claims. (Q111581666 had UUID and hash identifiers already from the write).
Did find and replace to change the dates to 2022-02-18 (Shenmeng's download date).
Removed four theses because their type couldn't be determined. In at least one case, it was a capstone, not a thesis. Saved in cleanup/removed_undetermined_type.csv
Also, had previously replaced regular double quotes in labels with single quotes (but not replaced in title) to prevent VanderBot from crashing when querying for duplicates. "Smart quotes" are OK and were not changed.
Other problems discovered during upload:
Decided to eliminate series ordinal for committee members since the first-listed committee member is not always the first listed on the title page and it is not guaranteed that even the first listed member is the chair. This is basically going to have to be some kind of student project or something to try to figure this out.
Disambiguation and cleanup prior to writing all of the electronic theses and dissertations