Missing Volumes in the Bookworm?

bmschmidt commented 8 years ago

Integrating the MARC records with the Bookworm, I've noticed that there seem to be just under a million books that don't exist in the Bookworm but do have Marc records. (IE, the bookworm has about 4.7 million volumes; there are about 5.5m volumes in the MARC records).

They are not evenly distributed. The losses include, most notably, every single Internet-Archive scanned book. Where have they gone? Maybe the entire open-open corpus is missing?

Here's a list of the scanners by number of volumes in the MARC files: (from field 974$s)

bschmidt@sibelius:/raid/hathipd$ jq '.scanner' jsoncatalognew.txt | sort | uniq -c
    156 "bc"
   1169 "borndigital"
      2 "brooklynmuseum"
    292 "clark"
     89 "clements-umich"
      3 "cornell"
  68344 "cornell-ms"
   4772 "getty"
    977 "geu"
4880745 "google"
 483568 "ia"
  54847 "lit-dlps-dc"
   1062 "mcgill"
  10717 "mdl"
  10501 "mhs"
     11 "mou"
    191 "nnc"
    374 "northwestern"
      1 "private"
   1109 "tamu"
    346 "ucm"
     68 "udel"
   4192 "uiuc"
     57 "umd"
      7 "umn"
    875 "ump"
     17 "wau"
  22948 "yale"
    420 "yale2"

Here, on the other hand, are the sources inside the bookworm (ie, the MARC records that also exist inside the bookworm).

Every IA-scanned book is gone; 68,000 Cornell-MS books are gone, and about 700,000 Google-scanned books are missing.

bschmidt@sibelius:/raid/hathipd$ mysql -e "SELECT scanner,COUNT(*) from contributing_library_serial_killer_guess GROUP BY scanner" hathipd
+----------------+----------+
| scanner        | COUNT(*) |
+----------------+----------+
| clements-umich |       32 |
| cornell        |        2 |
| google         |  4102853 |
| lit-dlps-dc    |    46365 |
| northwestern   |        1 |
| ucm            |       18 |
| yale           |    22947 |
+----------------+----------+

bmschmidt commented 8 years ago

It looks like this may be an upstream problem from the feature counts. They are only 4.8m volumes there. I spot checked several (not enough to be confident, though) at random and all were google scanned. @organisciak or someone else; are the features supposed to exclude ia-scanned books? Do they? Can we get them in Bookworm?

organisciak commented 8 years ago

The EF files didn't exclude anything, it's just that the PD collection has grown since we crunched EF version 0.2 in Feb 2015. We're currently working on non-PD data, we'll update PD Extracted Features later.

bmschmidt commented 8 years ago

My bad. It turns out this had to do with volume ids; the IA-scanned books are also the once that have colons and slashes, in the volume ids, and for whatever reason those are replaced with + and = in the volume identifiers in the Bookworm database. So the linkage was not happening on my end.

Oops. Should have listened to myself when I said I didn't check enough to be confident.

Bookworm-project / Bookworm-MARC

Missing Volumes in the Bookworm? #6