CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

Recover MARC records for UCSC ETDs #483

Open elopatin-uc3 opened 4 years ago

elopatin-uc3 commented 4 years ago

Summary

Sarah Lindsey, the head of metadata services at the UCSC Library contacted us about .mrc MARC record files that were once regularly delivered to them. These have not come through for some time now. They apparently went through a system migration on their side, and she now has the time to work with the records we (used to) send. https://cdl.freshdesk.com/a/tickets/79034

It's important to note that in the case of UCSC, we receive records in the form of .unx files directly from ProQuest on a roughly quarterly basis (sometimes more frequently). These are manually uploaded to the ETDs server and processed when the createmarc.py script runs. I've executed these uploads regularly and have noted that the PQ .unx files are deleted after the script runs. If we need to re-process these, I can probably dig up a series of them as email attachments from PQ.

Tasks

elopatin-uc3 commented 4 years ago

The createmarc.py script ran, which resulted in the following errors related to the two .unx files I uploaded to the server:

2020-10-01 13:31:31,493 ERROR: UNX file UC Santa Cruz MARC Q1 2020.unx not converted; missing ERROR     9798662497047
ERROR   9798662496941
ERROR   9798662481428
ERROR   9798662553729
ERROR   9798662481190
ERROR   9798662481695
ERROR   9798662481732

2020-10-01 13:31:35,071 ERROR: UNX file UC Irvine MARC Aug 2020.unx not converted; missing ERROR        9798662428157
ERROR   9798662428133
ERROR   9798662427846
ERROR   9798662427839
ERROR   9798662428775
ERROR   9798662428027

Related code in createmarc.py


485             # test if all ISBNs are available
486             test_str = xml_saxon_transform(namespace_xmlstr, constants.TEST_XSLT)
487             # convert using campus customizations using XSLT
488             if "ERROR" not in test_str:
489                 if campuscode is not None:
490                     campus_stylesheet = os.path.join(app_configs[hostenv]['xsl_dir'],
491                                                      campus_configs[campuscode]['pqmarcxslt'])
492                     campus_xml_str = xml_saxon_transform(namespace_xmlstr, campus_stylesheet)
493                     outfilename = campuscode+time.strftime("%Y%m%d")+'PQ-orig.xml'
494                     outfullpath = os.path.join(app_configs[hostenv]['marc_dir'],
495                                                outfilename)
496                     campus_xml_file = codecs.open(outfullpath, 'wb')
497                     campus_xml_file.write(campus_xml_str)
498                     campus_xml_file.close()
499                 else:
500                     logging.error("ERROR: campus code not found %s", marcfilename)
501             else:
502                 logging.error("ERROR: UNX file %s not converted; missing %s",
503                               marcfilename, test_str)
cpwillett commented 4 years ago

Hi Eric, I still get notifications from this repo which I mostly ignore. I saw this one though, and thought I'd provide some context. This error is generated when an ETD for which we've gotten a MARC record isn't available in Proquest yet. This means you could try to click on the URL that's in the MARC record that they've provided, and you'd get an error message. This happens sometimes, although it looks like there are quite a few in this latest group. (The ID in the error message is an ISBN.) Generally these are cleared up quickly--there can be a short lag in their processing. I asked them once to explain their workflow so I could understand this better, but they didn't. If the error persists, you can contact them and they'll pull a chain somewhere and the missing ones show up. Hope this is helpful. Let me know if you have questions about any of this--I might be able to explain (if I can still remember). Best wishes, Perry

elopatin-uc3 commented 4 years ago

@cpwillett Hi Perry, It's good to hear from you. I was referring to the ETD operations doc you'd put together, and along with it, this note provides more context – thanks! I hadn't realized the string in each error message is an ISBN (but should have). I've been in touch with several folks at ProQuest over the past half year or so, so may reach out to one or two for an explanation about the delay you mention. Perhaps they'll have more to share (or not). Either way, I'll re-run these two UNX files next week.

Since we're here, you may be able to confirm another bit of information. The campus.yml settings for UCSC show:

create_marc: False
delivery_marc: True

I assume the create_marc is set to False because we receive records from PQ. But confirmation on this would be great – especially since we receive records from PQ for Merced as well, and settings for Merced differ:

create_marc: True
delivery_marc: True

Thanks again. Hope you're doing well. David and Mark say "hello." Best, Eric

ps. Daniella, Maria and Marisa say hi too (John's out today). Maria notes, "all the non-UC EZID DOI users are finally transferred to DataCite!" And from Mark: Go Cubs! Oops, I mean Go Cardinals And... Brian, Scott and John ("Piscotty was a great trade!") say hello too.

cpwillett commented 4 years ago

Hi Eric, There are two different kinds of MARC records. The first kind are "created" from scratch, after it's first published in eScholarship. The second kind is an XSLT transformation of a MARC record "delivered" by Proquest several weeks after the ETD is received. The terminology is a little confusing--I had to work a little to remember the difference myself. Sorry about that. I also remembered something else about my previous message--here's how this check works. The ISBN is extracted from the PQ MARC record. It checks whether that ISBN is in the database (or the XML serialization of it). If not, it generates the error message. It could be that the ETD isn't in the Proquest database (puzzling and annoying), or it could be that it's there, but hasn't been matched by the pqgateway.py script and needs some manual intervention for some reason.
Give my best to everyone. It's autumn in south-central Michigan. Tell Mark I'm watching the Cubs game!
Perry

elopatin-uc3 commented 4 years ago

Latest update – the .unx files being delivered by ProQuest seem to be at least part of the problem. PQ sent me .mrc files and I am comparing the two. At this point, two abbreviated .mrc files have processed without the errors for the aforementioned ETDs.