Adding books to the dataset involves the following steps:
Identify a set of philosophy of science books, with an identifier for each (ISBN, LCCN, etc.)
Retrieve metadata for these books using this identifier
(If possible) retrieve some text (blurb/abstract, document-term matrix) for these books
Challenges
Step 1. I looked into this in Jan-Feb 2018, after attending a HathiTrust workshop. IIRC there is no LCC classification for philosophy of science, so we can't simply pull all of the books in this category.
In addition, I seem to recall that there was no way to do a bulk export of thousands of records from the LCC web catalog. So even if we decide to pull in one LCC category as a subset of philosophy of science, I couldn't find a way to export all of the records in a category.
See, however, LCC SRU API. AFAICT there's no R package to interface with this API.
Step 2. In theory LCC has all the metadata we would like. (Though maybe not the names of contributors to anthologies?) Possibly the LCC SRU API would solve both step 1 and step 2. But this can't be determined without delving into the API.
Step 3. The HathiTrust Extracted Features dataset contains page-level document-term matrices for a subset of the Google Books dataset. AFAIK this is the largest public dataset of its kind. However, when I checked in Jan-Feb 2018, it did not contain some notable philosophy of science works, such as Helen Longino's Science as Social Knowledge. This may mean that the HTEF dataset would not be useful for our project.
Adding books to the dataset involves the following steps:
Challenges
Step 1. I looked into this in Jan-Feb 2018, after attending a HathiTrust workshop. IIRC there is no LCC classification for philosophy of science, so we can't simply pull all of the books in this category.
In addition, I seem to recall that there was no way to do a bulk export of thousands of records from the LCC web catalog. So even if we decide to pull in one LCC category as a subset of philosophy of science, I couldn't find a way to export all of the records in a category.
See, however, LCC SRU API. AFAICT there's no R package to interface with this API.
Step 2. In theory LCC has all the metadata we would like. (Though maybe not the names of contributors to anthologies?) Possibly the LCC SRU API would solve both step 1 and step 2. But this can't be determined without delving into the API.
Step 3. The HathiTrust Extracted Features dataset contains page-level document-term matrices for a subset of the Google Books dataset. AFAIK this is the largest public dataset of its kind. However, when I checked in Jan-Feb 2018, it did not contain some notable philosophy of science works, such as Helen Longino's Science as Social Knowledge. This may mean that the HTEF dataset would not be useful for our project.