Bookworm-project / Hathitrust-Bookworm

A full text Bookworm on Public Domain Hathitrust works
6 stars 1 forks source link

Replace HTMetadata with Bookworm-MARC? #3

Open bmschmidt opened 8 years ago

bmschmidt commented 8 years ago

It's been my intention to replace the existing HTMetadata module here with the new Bookworm_MARC one I wrote in the spring. The goal is to pull the information we can directly from the MARC files, rather than intermediating through the Solr index.

I recommend this (in part because I think the date parsing is significantly better, and it captures some fields I think matter a lot like contributing library), but it's possible this is not the best way to integrate all the existing work at HTRC. My original hope was that Bookworm-MARC would bundle some HTRC code, but that ship has sort of sailed.

Another possibility is to use the MARC fields by default, but create a second supplemental table from Solr and load those in using bookworm add_metadata. Or vice-versa; Solr primarily, and MARC for supplemental information.

There's also the question of whether we should use first_publisher (as I do in MARC) or any_publisher.

organisciak commented 8 years ago

Yes, definitely. I was intending to deprecate the old HTMetadata. Feeding from a single source is sensible.

Do you have any example of what fields you hope to index? Is this example still current: https://github.com/Bookworm-project/Bookworm-MARC/issues/5? @tcole3 intended to put together metadata for BW following from the JSON files we crunched for the new full collection EF, but I think going completely with your codebase would be sensible. Tim was going to match up the proper date fields (for serial vs. non-serial), but I believe you've already done that?

@tcole3 might also have a thought about first_publisher vs any_publisher.

organisciak commented 8 years ago

I don't think publisher is a hugely enlightening snippet of information, so first_publisher seems sufficient, for what it's worth.

bmschmidt commented 8 years ago

Tim may have better ideas than I on exactly which date fields are best. My strategy has been (I think) the special Hathi field (974) is best, but then I honestly don't remember how I prioritize; whether the MARC publication field (260c, maybe?) or the first date in field 008 when they conflict is completely unclear to me.

If there is some strategy that varies with serial/nonserial for what field to look at, that would be great. My impression was that any date in field 974 tends to better than the record-level information since, as I found, so many serials seem to be listed as monographs and vice-versa. http://rpubs.com/benmschmidt/189321

That link is is pretty close to accurate, but I believe there may be some unpushed changes in the codebase. I will take a look after my class tomorrow.

bmschmidt commented 8 years ago

Publisher is one of those things that certain book history might care deeply about. But it probably requires extensive standardization to be useful; I started but did not complete that work.

organisciak commented 8 years ago

My limited understanding is that series' need the enumeration/chronology information to get the correct date rather than the first-published date, but that field can be incorrect for republished books. Again, deferring to the experts.