Bookworm-project / Bookworm-MARC

Parsing MARC records for Bookworm ingest
MIT License
4 stars 0 forks source link

Status check #9

Open bmschmidt opened 7 years ago

bmschmidt commented 7 years ago

I want to do a status check--after the penultimate conference call, I think I was supposed to check in with @organisciak on the status of this.

I've built out the jsoncatalog.txt for the bookworm; it can be as the file "jsoncatalog_hathi.txt.gz" hosted on my personal web domain. (Not posting the URL because it's huge and I don't want robots to try and download it.)

There are still a few issues. The "contributing library" is sometimes a code, because it's looked up in a sub-optimal place.

There's not a full field_descriptions.txt, because it's not yet determined exactly which fields we want in the bookworm. (I plan to do some visualizations with the MARC field item, for instance; but I doubt anyone else cares about that.)

But I think it's necessary to get the code in shape to redo the metadata anyway (which mostly works at present) so it should be safe to build the bookworm with some subset of the data here.

Here are some random entries from the top of the first million or so items in the file; they are all from the non-PD portion of the collection, where we've looked less in the past.

{"cataloging_source":" ","scanner":"google","lc0":"J","lc1":"J","date":1956,"item_date":1956,"rights_changed_date":"2014-05-22","lc2":"500","literary_form":"Unknown","serial_killer_guess":"book","title":"Lok Sabha debates.--- 1956 pt.2 v.6:9-15","filename":"uc1.b3890417","contributing_library":"nrlf","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=uc1.b3890417><em>Lok Sabha debates.--- 1956 pt.2 v.6:9-15</em> (1956)","target_audience":"Unknown or not specified","cntry":"ii ","first_place":"New Delhi,","lc_class_from_lc":true,"first_publisher":"Lok Sabha Secretariat.","permalink":"https://babel.hathitrust.org/cgi/pt?id=uc1.b3890417","language":"eng","government_document":"f","subject_places":["a-ii---"],"record_date":null,"marc_record_created":"1984-02-11","resource_type":"serial"}
{"literary_form":"Unknown","contributing_library":"nrlf","permalink":"https://babel.hathitrust.org/cgi/pt?id=uc1.b5175500","serial_killer_guess":"book","cataloging_source":"d","scanner":"google","language":"rus","title":"Novoe i zabytoe /--- v.1","government_document":" ","target_audience":"Unknown or not specified","filename":"uc1.b5175500","cntry":"ru ","rights_changed_date":"2013-08-03","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=uc1.b5175500><em>Novoe i zabytoe /--- v.1</em> (1966)","first_place":"Moskva :","date":1966,"first_publisher":"Nauka,","marc_record_created":"1984-02-04","resource_type":"serial","record_date":1966}
{"lc2":"1174","scanner":"lit-dlps-dc","lc0":"H","lc1":"HG","date":1928,"item_date":1928,"rights_changed_date":"2013-08-19","cataloging_source":"d","literary_form":"Not fiction","serial_killer_guess":"book","title":"Valuta i valutna politika; nauchna anketa za prichiniti͡e na stopanskata kriza v Bŭlgarii͡a.","filename":"mdp.39015057135074","first_author_name":"Toshev, Gospodin P.","contributing_library":"University of Michigan","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015057135074><em>Valuta i valutna politika; nauchna anketa za prichiniti͡e na stopanskata kriza v Bŭlgarii͡a.</em> (1928)","target_audience":"Unknown or not specified","cntry":"bu ","first_place":"Sofii͡a,","lc_class_from_lc":true,"first_publisher":"Kooperativna pechatnit͡sa \"Franklin\",","permalink":"https://babel.hathitrust.org/cgi/pt?id=mdp.39015057135074","language":"bul","government_document":" ","record_date":1928,"marc_record_created":"1988-07-18","resource_type":"book"}
{"lc2":"2342.2.","scanner":"google","lc0":"L","lc1":"LB","date":1986,"item_date":1986,"rights_changed_date":"2013-11-23","cataloging_source":" ","literary_form":"Not fiction","serial_killer_guess":"book","title":"China : management and finance of higher education.","filename":"mdp.39015038055250","contributing_library":"University of Michigan","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015038055250><em>China : management and finance of higher education.</em> (1986)","target_audience":"Unknown or not specified","cntry":"dcu","first_place":"Washington, D.C., U.S.A. :","lc_class_from_lc":true,"first_publisher":"World Bank,","permalink":"https://babel.hathitrust.org/cgi/pt?id=mdp.39015038055250","language":"eng","government_document":"i","subject_places":["a-cc---"],"record_date":1986,"marc_record_created":"1988-07-18","resource_type":"book"}
{"lc2":"1","scanner":"google","lc0":"G","lc1":"GN","date":1996,"item_date":1996,"rights_changed_date":"2013-10-17","cataloging_source":" ","literary_form":"Unknown","serial_killer_guess":"serial","title":"Bulletin of the National Science Museum.--- v.22 1996","filename":"mdp.39015073103726","contributing_library":"University of Michigan","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015073103726><em>Bulletin of the National Science Museum.--- v.22 1996</em> (1996)","target_audience":"Unknown or not specified","cntry":"ja ","first_place":"Tokyo,","lc_class_from_lc":true,"first_publisher":"National Science Museum.","permalink":"https://babel.hathitrust.org/cgi/pt?id=mdp.39015073103726","language":"eng","government_document":"f","record_date":null,"marc_record_created":"1988-07-18","resource_type":"serial"}
{"cataloging_source":" ","scanner":"google","lc0":"D","lc1":"D","date":1982,"item_date":1982,"rights_changed_date":"2015-04-03","lc2":"1","literary_form":"Unknown","serial_killer_guess":"serial","title":"The Historian : a journal of history.--- v.44 1981/1982","filename":"mdp.39015068987661","contributing_library":"University of Michigan","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015068987661><em>The Historian : a journal of history.--- v.44 1981/1982</em> (1982)","target_audience":"Unknown or not specified","cntry":"riu","first_place":"[Kingston, R.I., etc.] :","lc_class_from_lc":true,"first_publisher":"Phi Alpha Theta,","permalink":"https://babel.hathitrust.org/cgi/pt?id=mdp.39015068987661","language":"eng","government_document":" ","record_date":1938,"marc_record_created":"1988-07-18","resource_type":"serial"}
{"cataloging_source":" ","scanner":"google","lc0":"H","lc1":"HC","date":1969,"item_date":1969,"rights_changed_date":"2013-08-08","lc2":"10","literary_form":"Unknown","serial_killer_guess":"book","title":"Mirovai͡a ėkonomika i mezhdunarodnye otnoshenii͡a.--- 1969:7-12","filename":"uc1.b3230826","contributing_library":"nrlf","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=uc1.b3230826><em>Mirovai͡a ėkonomika i mezhdunarodnye otnoshenii͡a.--- 1969:7-12</em> (1969)","target_audience":"Unknown or not specified","cntry":"ru ","first_place":"Moskva :","lc_class_from_lc":true,"first_publisher":"Pravda.","permalink":"https://babel.hathitrust.org/cgi/pt?id=uc1.b3230826","language":"rus","government_document":"o","record_date":null,"marc_record_created":"1988-07-18","resource_type":"serial"}
{"cataloging_source":"d","scanner":"google","first_publisher":"Deutsche Verlags-Anstalt","item_date":2005,"rights_changed_date":"2013-08-09","literary_form":"Unknown","serial_killer_guess":"book","title":"Osteuropa--- v.55:8 2005","filename":"uc1.32106020346950","contributing_library":"ucsc","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=uc1.32106020346950><em>Osteuropa--- v.55:8 2005</em> (2005)","target_audience":"Unknown or not specified","cntry":"gw ","first_place":"Stuttgart :","date":2005,"permalink":"https://babel.hathitrust.org/cgi/pt?id=uc1.32106020346950","language":"ger","government_document":" ","subject_places":["ee-----"],"record_date":null,"marc_record_created":"1975-09-01","resource_type":"serial"}
{"cataloging_source":"d","scanner":"google","first_publisher":"Badan Usaha Jaya Press Jajasan Jaya Raya],","item_date":1988,"rights_changed_date":"2015-09-03","literary_form":"Unknown","serial_killer_guess":"book","title":"Tempo.--- 1988 Index","filename":"mdp.39015066449201","contributing_library":"University of Michigan","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015066449201><em>Tempo.--- 1988 Index</em> (1988)","target_audience":"Unknown or not specified","cntry":"io ","first_place":"[Djakarta,","date":1988,"permalink":"https://babel.hathitrust.org/cgi/pt?id=mdp.39015066449201","language":"ind","government_document":" ","subject_places":["a-io---"],"record_date":1971,"marc_record_created":"1988-07-18","resource_type":"serial"}
{"cataloging_source":"d","scanner":"google","date":1974,"item_date":1974,"rights_changed_date":"2013-08-04","literary_form":"Not fiction","serial_killer_guess":"book","title":"Solar energy / c[Vlastimir A. Stevovich, Informatics, Inc.] ; csponsored by Advanced Research Project Agency.","filename":"mdp.39015002048653","first_author_name":"Stevovich, Vlastimir A.","contributing_library":"University of Michigan","searchstring":"<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015002048653><em>Solar energy / c[Vlastimir A. Stevovich, Informatics, Inc.] ; csponsored by Advanced Research Project Agency.</em> (1974)","target_audience":"Unknown or not specified","cntry":"vau","first_place":"Rockville, Md. :","first_publisher":"Informatics Inc.,","permalink":"https://babel.hathitrust.org/cgi/pt?id=mdp.39015002048653","language":"eng","government_document":" ","record_date":1974,"marc_record_created":"1988-07-18","resource_type":"book"}
organisciak commented 7 years ago

Could you sum of all the return values for contributing_library? e.g. cat test.json | jq '.contributing_library' | sort | uniq. I know NRLF is "University of California Northern Regional Library Facility", Eleanor should have info for others.

Also, send me the URL for the imperfect current version that you have hosted, I'll try to build it with the unigrams.

bmschmidt commented 7 years ago

I've fixed up the contributing libraries more recently by using the first few characters of the identifier instead of the contributing library code. That seems to do it.

I will look around for the data. I've noticed a few changes to the bookwormDB repo that need to be made. A few are already on my hosted version and I will push them this afternoon.

A few additional changes that need to happen I'm noting here, even though it's the wrong place, that will make the memory tables work better.

1. SET optimizer_search_depth=0;
2. Increase Memory Table limit by about 50%.
3. Ensure silencing of errors is working.