Bookworm-project / Bookworm-MARC

Parsing MARC records for Bookworm ingest
MIT License
4 stars 0 forks source link

Field `008` corrupted in Hathi-DPLA dumps #2

Closed bmschmidt closed 8 years ago

bmschmidt commented 8 years ago

According to the LOC, the 3-character publication place information in field 008 should be in bytes 15 to 17. But in many of the records from DPLA, it's instead in bytes 12-14. Why?

https://github.com/Bookworm-project/Bookworm-MARC/blob/master/bookwormMARC/bookwormMARC.py#L59-L72

bmschmidt commented 8 years ago

OK, I have figured this out.

It looks to me like in the dumps they send DPLA, someone is replacing multispace sequences with a single space. So for example, in HathiTrust record number 100308172, you get (from the url here)

the following field 008:

000502s1908    enkae  |o|||| 001 0 eng d

But in the dumps, it looks like this:

000502s1908 enkae |o|||| 001 0 eng d

Since 008 is a fixed-width field, this is bad. Some piece of code, either at Hathi or inside DPLA, is over-aggressively cleaning. Without the spaces publication place and language data can't be extracted.

bmschmidt commented 8 years ago

I've written DPLA about this to see if they know why it's happening. @jjett or @organisciak, do you have a contact at Hathi we could ask about this?

jjett commented 8 years ago

Actually, since 008 is a fixed width field, couldn't you just use a regular expression to cherry pick the publication place and language?

bmschmidt commented 8 years ago

Yeah, we should be able to extract the bytes by position--but the issue is that some bytes are being thrown away unpredictably, so we can't do that in full assurance of the MARC fields still working.

We could use internal hathi records which aren't junked like the DPLA versions... But there are access issues.

bmschmidt commented 8 years ago

Mark Matienzo writes back from DPLA that it's their fault, but that they're scaling back from offering full metadata anyway. I have found the original XML files and will work with those instead.