Bookworm-project / Bookworm-MARC

Parsing MARC records for Bookworm ingest
MIT License
4 stars 0 forks source link

What's going on in the Hathi-specifc MARC field 974? #1

Open bmschmidt opened 8 years ago

bmschmidt commented 8 years ago

There's lots of good information in the Hathi-specific MARC field 97, but I don't totally understand it.

  1. 974b and 974c both seem to be libraries. They're often, but not always the same. What's the difference?
  2. Is there a lookup dictionary anywhere that expands the codes to a full string? (Eg, HVD -> "Harvard University Libraries", or whatever it is?)
  3. What role does 974t play compared to the record title field? Is it safe to smash them together?
  4. What role does 974y play compared to the record publication date field? Is it safe to always overwrite the record date (which may be a serial beginning date) with 974y (which generally gives the year for the specific serial volume)?
  5. If there are two 974 fields for a single bibliographic record, but they do not have different title identifiers, what are the chances that they are actually the same volume?
jjett commented 8 years ago

Hi Ben,

The 974 field is not a standard MARC 21 field (i.e., not something supported by OCLC). This seems like it might be metadata defined by HathiTrust or possibly by the digital object's owning library. I'll be examining some HT MARC later this week and early next week. I should be able to provide some more definitive answers to you then.

Regarding organization codes: There are many sources. The following website lists many of them: https://www.loc.gov/marc/holdings/echdorg.html

jjett commented 8 years ago

An update on this issue. 974 has been scrutinized in the past. Our metadata librarian here at Illinois provided the following information:

"The 974 field is added by Zephir, the CDL that processes HathiTrust metadata. Especially, subfields b, c, and s are added by Zephir and not used in the HT catalog. For subfields u, r, and d: Subfield d is the date of the last rights change. Subfield r is a copyright code. You can find a list of copyright codes used in the HathTrust in here. https://www.hathitrust.org/rights_database

Subfield u works as the HathiTrust item id which is composed of the namespace assigned to it and the identifier provided by the institution. There's also a subfield z which may occur in the 974 field - that is where the enum/chron is stored. "

With regards to subfield t, I wouldn't mash the titles together -- they are different accounts of what the title is. I haven't verified it yet but the relationship between 975y and 008 may be the same.

The important factor is that everything in the 9XX fields is from a different person than the rest of the record. So you'll need to note it as a source of data bias if you use them. You might be able to do some great comparative analyses on the assertions being made in 974 and the rest of the record but, I wouldn't mix them together.

bmschmidt commented 8 years ago

OK, that's very helpful.

It would be helpful to know what Zephir intends by the b and c subfields; originating library is an important fact to have in the Bookworm. I suppose we could also parse it out of subfield u.

974y is not always the same as 008; it seems to differ in maybe 5% of cases. See, for example, the xml for the "Congressional Record". That's a serial publication with a single Hathi record. The record stores a hundred or so items; each item has its own 974 field, and the 974y field appears there to give the publication date for the volume rather than the series.

If that is the universal practice, it will be extremely useful for Bookworm; serial misdating is a major issue right now, so any solution to that is (I think--perhaps this needs to be discussed) worth the risks of using hybrid data sources for the records. The online Hathi catalog uses that field for display, so someone in Michigan must have an answer to this.

I had not seen 974z. It seems to be populated possibly a little less often than 974y and 974t but contain some of the same information. Hmm.

jjett commented 8 years ago

The following mapping document for organizations may also be helpful:

https://docs.google.com/document/d/1ILpVfk5y3auLpBicflpGIJbf9bnNZ9qeJwkW48CZ1Xo/edit

Looking over 974z, it seems roughly equivalent to the strings that appear in the "enum" attribute of the json (in the parts beyond the marc-xml blob). I'll see if I can contact someone at the California Digital Library on Monday. Zephir seems to be their brain child. It is odd that 974 doesn't appear in the HathiTrust's documentation but a different field 955 does. It seems possible that these two fields are doing something similar.

jjett commented 8 years ago

Regarding 974 b and c -- this seems to indicate the source institution. For instance, UC--UCLA refers to the University of California, Los Angeles whereas UC--NRLF refers to the the Northern Regional Library Facility which is a part of the University of California, Berkeley Libraries.

bmschmidt commented 8 years ago

OK, thanks so much for looking into this. It's great to know where this is coming from.

bmschmidt commented 8 years ago

After today's conference call I wanted to bump this for @jjett:

Here are the remaining questions I have about 974, restated from above with new information.

  1. Is there a published definition of the 974 field anywhere?
  2. 974b and 974c both seem to be libraries. They're often, but not always the same. What's the difference? If we wanted a single field called 'originating library' in the Bookworm, should we use 974b, 974c, or parse the namespace from the record identifier in 974u?
  3. What role does 974t play compared to the record title field? (It appears to contain subrecord information like "vol. 1" or "1894". For displaying titles to users in cases when they may be seeing many results from the same record, can we simply separate the record title from 974t with a double dash or something.
  4. What role does 974y play compared to the record publication date field? Is it safe overwrite the record date (which may be a serial beginning date) with 974y (which generally gives the year for the specific serial volume)? Our goal is identify the year of publication for each item, and not have the record start year propagate through to each journal entry.
  5. If there are two 974 fields for a single bibliographic record, but they do not have different title identifiers, what are the chances that they are actually the same volume scanned from two different sources?
jjett commented 8 years ago

Have cc'd you and Peter on an email with your questions to Jonathan Rothman (@U_Mich) who is on the Zephir team that oversees the HTDL's MARC metadata.

jjett commented 8 years ago

Copying Bill's answers here for the project's records.

Hey Jacob (et. al.) The short answers are:

There isn’t a published definition of the 974 — we treat it as an internal implementation detail, so we never bothered. The HT has gone through a few different ways of denoting the originating institution. You’re going to want to use the ‡c, which we call the collection code, and map it to the institution. I’ve included the current mapping below (which is many-to-one, since some places have sent multiple collections they want to keep administratively distinct) Do you mean the 974‡z? That contains the enumeration/chronology and is not at all controlled. For serials or multi-volume sets, this is how the contributing institution coded the individual volume information. It might include specific issue information (“vol. 1, no.3”), just the year of issues that were bound together (“1988”, or “1994”), or really anything that made sense at the time. Things like 5:no.1-5:no.11 1984:Oct.-1985:Aug. and V 11-13,14b/d no 11ab - 14 Jul 93 + abs 1992/93 c-f not e index are not as uncommon as you’d hope. The publication date for the catalog record is our best guess of the date the thing described at the record level was published, while the ‡y is our best guess at when a particular volume was published. For a book they should be the same, but for a journal the bib record date will be the date of first publication, while the date in the ‡y is the year pulled out of the enum/chron (where possible). For example, the catalog date for Scientific American is 1845, but the individual volume dates (as seen in the 974‡y in the MARC record) goes up to 1987. All we can say about that situation is that we got two volumes with the same OCLC number in their metadata. Sometimes they’re duplicate scans; other times it’s just shoddy or incomplete cataloging. I’d love it if you’d drop me a note and let me know what you’re up to with these data!

-Bill-

HathiTrust Collection Code mapping

"mdp" => "University of Michigan", "miua" => "University of Michigan", "miun" => "University of Michigan", "wu" => "University of Wisconsin", "inu" => "Indiana University", "uc1" => "University of California", "uc2" => "University of California", "pst" => "Penn State University", "umn" => "University of Minnesota", "nnc1" => "Columbia University", "nnc2" => "Columbia University", "nyp" => "New York Public Library", "uiuo" => "University of Illinois", "njp" => "Princeton University", "yale" => "Yale University", "chi" => "University of Chicago", "coo" => "Cornell University", "ucm" => "Universidad Complutense de Madrid", "loc" => "Library of Congress", "ien" => "Northwestern University", "hvd" => "Harvard University", "uva" => "University of Virginia", "dul1" => "Duke University", "ncs1" => "North Carolina State University", "nc01" => "University of North Carolina", "pur1" => "Purdue University", "pur2" => "Purdue University", "mdl" => "Minnesota Digital Library", "usu" => "Utah State University Press", "gri" => "Getty Research Institute", "uiug" => "University of Illinois", "psia" => "Penn State University", "bc" => "Boston College", "ufl1" => "University of Florida", "ufl2" => "University of Florida", "txa" => "Texas A&M University", "keio" => "Keio University", "osu" => "The Ohio State University", "uma" => "University of Massachusets", "udel" => "University of Delaware", "caia" => "Clark Art Institute Library"

billdueber commented 8 years ago

Following up: if you folks just throw an @billdueber in any issues where you have questions I might be able to answer, I'll get the ping and see what I can do. Don't guess when you can just ask :-)