Bookworm-project / Bookworm-MARC

Parsing MARC records for Bookworm ingest
MIT License
4 stars 0 forks source link

Non-MARC metadata about scans #8

Closed bmschmidt closed 7 years ago

bmschmidt commented 7 years ago

OK, I just noticed something interesting while paging through some of my OCR outlier scripts about useful language metadata about scans not included in Hathi MARC fields.

Randomly tagging @billdueber and @organisciak, because this is one of those questions that I have no idea who in Hathi knows the answer to.

Google OCR occasionally seems to mistakenly read things as being in Arabic because, I guess, it's a squiggly language in wide use. Most items on this tsv, for example, are English-language texts that, if you click through to the OCR on Hathi, are clearly mistakenly being read in the OCR as Arabic.

What's interesting to me is that Hathi seems to know this. If you click on the following English-language books (and your Hathi browser is set to display 2-up), you'll see that they start on the last (i.e., rightmost) page--the page that would be appropriate if they were written in Arabic. That is, the language metadata has them as being in English, but something else has tagged them as Arabic (or at least, generically, as in "in an RTL-script.") Examples; I suspect I can supply several dozen more if needed:

So what's going on that makes these display RTL? Is it the sequencing information (these books resolve to urls like https://babel.hathitrust.org/cgi/pt?id=hvd.32044107205387;view=2up;seq=1) : That would indicate Hathi believes these books to be in English, either because

  1. Google must have sent them to the libraries with sequencing data that regarded them as Arabic, based on their OCR.
  2. Hathi itself determines sequence order by looking at the character set to tell it what direction to go in.

Either one of those would be useful information to have in the Bookworm--at a minimum, it would be nice to be able to drop books that are actually in English but whose OCR is just nonsense in the Arabic alphabet. To have a Google determination of language to compare against the MARC determination (if it is language, and not character set, that's operative here) would be extremely useful.

But I'm also very curious where this data is being stored, because it definitely indicates some useful information about the scans that is not in the MARC records about the items. Maybe that data is limited to just one bit (whether to display a book online as LTR or RTL), but might it also contain some other information about character set, OCR provenance, or the like?

organisciak commented 7 years ago

That's very odd. The guess around here is that it is in the METS data, but I'll forward this to Tom Burton-West, who should know.

bmschmidt commented 7 years ago

Closing. I believe, for random future googlers landing here, that the answer was METS.