Closed elopatin-uc3 closed 1 year ago
Terry and I spent three hours looking at the title and author cleaning methods in pqgateway.py on stage. We're testing a small number of improvements, but still seeing a significant number or PQ search queries return 0 results. On stage, the latest stats are:
No hits:
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['0" temp_pq_log.log
593
ETD found:
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['1" temp_pq_log.log
2194
Null result (search parameter error):
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\[\]" temp_pq_log.log
6
Improvements attempted:
Debugging:
After adding a space on line 100 (to replace Greek chars with space), worse results:
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['0" temp_pq_log.log
573
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['1" temp_pq_log.log
2183
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\[\]" temp_pq_log.log
6
Note that the line 95, 96 change results in the drop in "0 hits" from 593 to 573.
Current stats (all but line 100 changes applied):
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\[\]" temp_pq_log.log
6
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['1" temp_pq_log.log
2183
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['0" temp_pq_log.log
573
I've run all UCI UNX files on stage. Dozens of missing ISBNs surface in the logs. I'm starting to keep track of these here: https://docs.google.com/spreadsheets/d/1JvMULxf9XjMf9v5PgXnhTwGsvhsxVBIa3Hycol5I66w/edit?usp=sharing
A few of the smaller UNX files processed, resulting in a handful of records in two .mrc files. The CSV report generated successfully (though I haven't compared the number of entries to the number of ETDs in Merritt).
Another observation: There are times where we have an entry in the pq_gateway table identified via ISBN, but this does not show up in pq-merritt-match.xml. e.g. 9798557021555
I'm starting to wonder if the process that updates pq-merritt-match.xml is silently failing somehow. pq-merritt-match.xml has over 23K entries. We have thousands more ETDs in Merritt however.
Also note that we have approximately 31K entries in the pq_gateway table in the ETDs database.
I attempted to process all UCI UNX files on Stage and hundreds came back as errors (missing ISBN, even if we do have them in the pq_gateway table) in the log. It's noteworthy that entries the July 2022 UNX file were missing 856 fields, per the log as well.
In the majority of error cases in the log, the pq_gateway table seems to have the ISBN in question. Which means the cause of the error is something other than originally suspected (I originally though ISBNs were missing from the db).
Investigate line 497 and use of constant TEST_XSLT, which references PQ-test.xsl.
Error thrown from line 513.
Terry and I discovered that the upd_pq_merritt_match SELECT
statement is failing to find matching entries via:
pq_gateway.title = merritt_ingest.obj_title
I'm proceeding to correct database entries in these tables for a series of ETDs that are noted as having "missing" ISBN numbers in a corresponding UNX file. Really what's happening is that although the ISBN numbers in the UNX file are in the ETDs database, the above SELECT
can't find the data it needs because the match doesn't occur between the pq_gateway
and merritt_ingest
tables per inconsistent title data.
Steps to rectify:
Marking as Done, as we've now documented steps needed to work through processing files.
Our pqgateway.py script's methods that clean the strings we send as query parameters to ProQuest's federated search gateway need refinement. This is likely the root cause of not having ISBNs in the ETDs database, which in turn prevents the generation of MARC records when a PQ .unx file is being used as a source for said records.
Examples: