CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

Investigate pqgateway query string clean-up methods #1250

Closed elopatin-uc3 closed 1 year ago

elopatin-uc3 commented 1 year ago

Our pqgateway.py script's methods that clean the strings we send as query parameters to ProQuest's federated search gateway need refinement. This is likely the root cause of not having ISBNs in the ETDs database, which in turn prevents the generation of MARC records when a PQ .unx file is being used as a source for said records.

Examples:

ERROR:
http://fedsearch.proquest.com/search/sru/pqdt?operation=searchRetrieve&version=1.2&maximumRecords=30&startRecord=1&query=title%3D%22Effects+hyperthermia+Sphase+%5Bi.e.%22%20AND%20author%3D%22Mackey%2C+Michael%22

Removed "+%5Bi.e." and the "S" in front of "Sphase" and this works:
(if you don't remove the "S" it retrieves 0 records)
http://fedsearch.proquest.com/search/sru/pqdt?operation=searchRetrieve&version=1.2&maximumRecords=30&startRecord=1&query=title%3D%22Effects+hyperthermia+phase%22%20AND%20author%3D%22Mackey%2C+Michael%22

Diagnostic reply noting 0 records found:
http://fedsearch.proquest.com/search/sru/pqdt?operation=searchRetrieve&version=1.2&maximumRecords=30&startRecord=1&query=title%3D%22Conversion+dietary+glycoside+rutin%22%20AND%20author%3D%22Abdel-Salam%2C+Maha%22

Looks OK (ISBN AND RECORD RETRIEVED):
http://fedsearch.proquest.com/search/sru/pqdt?operation=searchRetrieve&version=1.2&maximumRecords=30&startRecord=1&query=title%3D%22Power+spectral+analysis+inspiratory%22%20AND%20author%3D%22Richardson%2C+Charles%22
elopatin-uc3 commented 1 year ago

Terry and I spent three hours looking at the title and author cleaning methods in pqgateway.py on stage. We're testing a small number of improvements, but still seeing a significant number or PQ search queries return 0 results. On stage, the latest stats are:

No hits:
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['0" temp_pq_log.log 
593
ETD found:
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['1" temp_pq_log.log 
2194
Null result (search parameter error):
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\[\]" temp_pq_log.log 
6

Improvements attempted:

Debugging:

elopatin-uc3 commented 1 year ago

After adding a space on line 100 (to replace Greek chars with space), worse results:

etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['0" temp_pq_log.log 
573
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['1" temp_pq_log.log 
2183
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\[\]" temp_pq_log.log 
6

Note that the line 95, 96 change results in the drop in "0 hits" from 593 to 573.

elopatin-uc3 commented 1 year ago

Current stats (all but line 100 changes applied):

etds@:~/apps/uc3-etds/scripts$ egrep -c "^\[\]" temp_pq_log.log 
6
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['1" temp_pq_log.log 
2183
etds@:~/apps/uc3-etds/scripts$ egrep -c "^\['0" temp_pq_log.log 
573
elopatin-uc3 commented 1 year ago

I've run all UCI UNX files on stage. Dozens of missing ISBNs surface in the logs. I'm starting to keep track of these here: https://docs.google.com/spreadsheets/d/1JvMULxf9XjMf9v5PgXnhTwGsvhsxVBIa3Hycol5I66w/edit?usp=sharing

A few of the smaller UNX files processed, resulting in a handful of records in two .mrc files. The CSV report generated successfully (though I haven't compared the number of entries to the number of ETDs in Merritt).

elopatin-uc3 commented 1 year ago

Another observation: There are times where we have an entry in the pq_gateway table identified via ISBN, but this does not show up in pq-merritt-match.xml. e.g. 9798557021555

I'm starting to wonder if the process that updates pq-merritt-match.xml is silently failing somehow. pq-merritt-match.xml has over 23K entries. We have thousands more ETDs in Merritt however.

Also note that we have approximately 31K entries in the pq_gateway table in the ETDs database.

elopatin-uc3 commented 1 year ago

I attempted to process all UCI UNX files on Stage and hundreds came back as errors (missing ISBN, even if we do have them in the pq_gateway table) in the log. It's noteworthy that entries the July 2022 UNX file were missing 856 fields, per the log as well.

elopatin-uc3 commented 1 year ago

In the majority of error cases in the log, the pq_gateway table seems to have the ISBN in question. Which means the cause of the error is something other than originally suspected (I originally though ISBNs were missing from the db).

Investigate line 497 and use of constant TEST_XSLT, which references PQ-test.xsl.

Error thrown from line 513.

elopatin-uc3 commented 1 year ago

Terry and I discovered that the upd_pq_merritt_match SELECT statement is failing to find matching entries via:

pq_gateway.title = merritt_ingest.obj_title
elopatin-uc3 commented 1 year ago

I'm proceeding to correct database entries in these tables for a series of ETDs that are noted as having "missing" ISBN numbers in a corresponding UNX file. Really what's happening is that although the ISBN numbers in the UNX file are in the ETDs database, the above SELECT can't find the data it needs because the match doesn't occur between the pq_gateway and merritt_ingest tables per inconsistent title data.

Steps to rectify:

  1. Get ISBN from etd.log
  2. Look up ISBN at PQ search: https://www.proquest.com/?accountid=14519
  3. Copy title (or portion of title) from PQ search result
  4. Check for title in pqgateway script log (temp_pq_log_EOD1026.log) and find out if a match was found in PQ search. Note this requires print statements that we added for debugging. If the ETD was not found, an entirely new row will need to be added to the pq_gateway table.
  5. Search for title in pq_gateway table
  6. Search for title in merritt_ingest table
  7. Correct discrepancies – make sure to commit database entries
  8. Note trends
elopatin-uc3 commented 1 year ago

Marking as Done, as we've now documented steps needed to work through processing files.