eXtensibleCatalog / Metadata-Services-Toolkit

Tools for processing and aggregating metadata
Other
6 stars 3 forks source link

Saudi issues #588

Open patrickzurek opened 8 years ago

patrickzurek commented 8 years ago

JIRA issue created by: banderson Originally opened: 2011-08-03 05:28 PM

Issue body: (nt)

patrickzurek commented 8 years ago

JIRA Coment by user: banderson JIRA Timestamp: 2011-08-03 05:28 PM

Comment body:

email from stephen:


When I try using the either of the OAI URLs in the MST listed at

   http://www.hathitrust.org/data

 

I get the following error when I try to validate it:

     oaiErrCode: noRecordsMatch The combination of the values of the from, until, set and metadataPrefix arguments results in an empty list.

 

I have tried testing the URL in the Open Archives Initiative Repository Explorer at

     http://re.cs.uct.ac.za/

 

and I get a large number of "Illegal verbs" squawks.

 

I have tried both ":pdus" and without and end up getting the same results.  However, I am able to see all of the records in well formed XML when I click on the links on the Hathi trust page.

 

Any thoughts?  Are there parameters that could allow us to do a one-time dump (without the missing parameters)?  Maybe download and load from a file, somehow?

 

I have set up OAI repository software here for our oral history database.  Maybe I could try downloading records and using that?

 

Thanks for any help that you can provide.

 

Stephen

 

patrickzurek commented 8 years ago

JIRA Coment by user: rcook JIRA Timestamp: 2011-08-03 07:12 PM

Comment body:

I think we need to do some exploration to assess whether there is a problem in the MST or not (are the correct URL being used, etc.) . If we determine that Hathi is not responding as a valid OAI repos than perhaps we do nothing with this.now Before we can decide what to do, we need to know more.

patrickzurek commented 8 years ago

JIRA Coment by user: rcook JIRA Timestamp: 2011-08-03 07:28 PM

Comment body:

Question: The urls on the Hathi site are ListRecords that include prefixes and sets. Is that what is needed for the Add repos page? I thought we needed some url level above this and we generated the OAI verbs we wanted. So it doesn't seem like that is the url that we would use in our "Add a repos."

patrickzurek commented 8 years ago

JIRA Coment by user: rcook JIRA Timestamp: 2011-08-03 08:35 PM

Comment body:

Based in the test I just did, I was able to add the Hathi repos to the Demo MST. The url being used by Stephen was the wrong one.

http://www.hathitrust.org/data

http://quod.lib.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=marc21&set=hathitrust
http://quod.lib.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=hathitrust

The correct link would be:

http://quod.lib.umich.edu/cgi/o/oai/oai?

Based on this, I don't know that there is any issues here. Please close unless someone feels there is a reason to leave this open.

patrickzurek commented 8 years ago

JIRA Coment by user: rcook JIRA Timestamp: 2011-08-05 12:52 PM

Comment body:

There were several obstacles/false starts to Stephen's testing. This included using a full harvest request url to setup the repos instead of the correct one, trying to run DC records through marc norm, still not setting up the repos correctly (he excluded the ?). Now he has resorted to copying table and field structures from his ILS MST that works to his 2nd MST install and trying muddle through. I have asked him to stop. It is unclear at this point whether there is an MST problem or not.

We need to test. I am willing to try to help but I do not have an 0.3.5 server to test on.

We need to:

  1. Test Harthi harvest of MARC records through Marc Norm and Transform
  2. Test Hathi harvest of DC records through DC Transform.

I propose we only do this with a small record set, say 5000 records.

Once we know more we can better prioritize.


Below is copy of last email from Stephen to me.

Randall,

Thanks for the stripped down list AND for the response.

First of all, the URL that shows up for Hathi Trust had a question mark on the end. That blew up when I tried it in the OAI Repository Explorer (but am not sure I tried in the MST). However, ending with the question mark today worked correctly in the MST and it validated.

Secondly, the point at which only oai_dc showed up was in creating the Processing Rules. At that point, only oai_dc showed up (and only showed up today in the same spot). However, it is true that when I was setting up the harvesting, THEN it showed all three (MARC21, oai_dc, and MODS).

Thirdly, somehow it did create the databases that were needed, but not any fields in those. I therefore did a dump of the table structure of xc_marcnormalization, xc_marctoxctransformation, and xc_uncc_atkins_library tables from the successful harvesting of catalog metadata and loaded them into the xc_marcnormalization, xc_marctoxctransformation, and xc_hathi_trust tables respectively in the server where I'm trying to harvest the Hathi Trust records.

Some progress! The harvesting seems to have started working - at least somewhat. I was able to download approximately 30,000 records. However, once they are loaded, I'm not clear as to whether to do marcnormalization or marctoxctransformation. In trying either one, I get "status unknown" in the green box where the progress indicator is located.

In looking at the MST_General_log.txt, it gave the following message:

04 Aug 2011 17:04:09,429 xc.mst.scheduling.Scheduler:391 ERROR [Thread-3] - \ EXCEPTION occured while trying to process next jobtostart ! java.lang.OutOfMemoryError: Java heap space

In the MST_timing_log.txt, it had the text behind its running out of memory (see below):

I looked through the log file and, while I saw where the creation of the other three databases (xc*) were done, I saw nothing about the xc_hathi_trust database until the logs squawked about the lack of a certain field that was supposed to be in there.

I have completely cleaned out all of the files and have restored the empty ones and tried restarting the harvesting. It found a lot more than the 30,000 than I had been seeing, but just quit at about a little over 800 records retrieved and now has "status unknown" on the admin page. I've looked at the logs and don't see anything - certainly no memory problems. It may just have been a glitch or a problem in Ann Arbor.

I guess it's time to go home.

I'll let you know what happens tomorrow. If not you, please let me know who - if anybody - I should work with.

Stephen

04 Aug 2011 16:54:28,412 DEBUG [Thread-9] - Free memory: 7 MB. 04 Aug 2011 16:54:28,412 DEBUG [Thread-9] - Used memory: 56 MB. 04 Aug 2011 16:54:28,412 DEBUG [Thread-9] - Increased by: 0 MB. 04 Aug 2011 16:54:28,412 DEBUG [Thread-9] - Total memory: 63 MB. 04 Aug 2011 16:54:28,413 DEBUG [Thread-9] - Max'm memory: 63 MB. 04 Aug 2011 16:54:28,413 DEBUG [Thread-9] - 04 Aug 2011 16:54:28,413 DEBUG [Thread-9] - 04 Aug 2011 17:00:50,284 DEBUG [Thread-3] - 04 Aug 2011 17:00:50,285 DEBUG [Thread-3] - 04 Aug 2011 17:00:50,878 DEBUG [Thread-3] - reset() 04 Aug 2011 17:00:50,879 DEBUG [Thread-3] - timeSinceLastReset: 382476 04 Aug 2011 17:00:51,480 DEBUG [Thread-3] - namedTimers.size(): 19 04 Aug 2011 17:00:51,481 DEBUG [Thread-3] - includeDefault: true 04 Aug 2011 17:00:52,079 DEBUG [Thread-3] - TimingLogger! total: 18828 avg: 18828.00 longest: 18828 num: 1 commit to db 04 Aug 2011 17:00:52,750 DEBUG [Thread-3] - TimingLogger! total: 2224 avg: 2224.00 longest: 2224 num: 1 RECORDS_TABLE.insert 04 Aug 2011 17:00:52,751 DEBUG [Thread-3] - TimingLogger! total: 1716 avg: 1716.00 longest: 1716 num: 1 RECORDS_XML_TABLE.insert 04 Aug 2011 17:00:53,684 DEBUG [Thread-3] - TimingLogger! total: 0 avg: 0.00 longest: 0 num: 5000 RECORDS_XML_LENGTH 04 Aug 2011 17:00:54,371 DEBUG [Thread-3] - TimingLogger! total: 9518 avg: 9518.00 longest: 9518 num: 1 RECORDS_SETS_TABLE.insert 04 Aug 2011 17:00:54,372 DEBUG [Thread-3] - TimingLogger! total: 4 avg: 4.00 longest: 4 num: 1 RECORD_PREDECESSORS_TABLE.insert 04 Aug 2011 17:00:55,180 DEBUG [Thread-3] - TimingLogger! total: 2088 avg: 2088.00 longest: 2088 num: 1 RECORD_OAI_IDS.insert 04 Aug 2011 17:00:55,788 DEBUG [Thread-3] - TimingLogger! total: 3273 avg: 3273.00 longest: 3273 num: 1 RECORD_UPDATES_TABLE.insert 04 Aug 2011 17:00:56,393 DEBUG [Thread-3] - TimingLogger! total: 2 avg: 2.00 longest: 2 num: 1 record_links.insert 04 Aug 2011 17:00:56,985 DEBUG [Thread-3] - TimingLogger! total: 9560 avg: 2390.00 longest: 3505 num: 4 sendRequest 04 Aug 2011 17:00:58,827 DEBUG [Thread-3] - TimingLogger! total: 12157 avg: 2431.40 longest: 3513 num: 5 http 04 Aug 2011 17:00:59,443 DEBUG [Thread-3] - TimingLogger! total: 0 avg: 0.00 longest: 0 num: 5 getSaxBuilder 04 Aug 2011 17:00:59,444 DEBUG [Thread-3] - TimingLogger! total: 911 avg: 227.75 longest: 468 num: 4 sax 04 Aug 2011 17:01:00,041 DEBUG [Thread-3] - TimingLogger! total: 5253 avg: 1313.25 longest: 1768 num: 4 parseRecords 04 Aug 2011 17:01:00,644 DEBUG [Thread-3] - TimingLogger! total: 226 avg: 0.11 longest: 42 num: 2000 getRecordService().parse(recordEl) 04 Aug 2011 17:01:01,235 DEBUG [Thread-3] - TimingLogger! total: 0 avg: 0.00 longest: 0 num: 2000 dyncache.string 04 Aug 2011 17:01:01,236 DEBUG [Thread-3] - TimingLogger! total: 244194 avg: 244194.00 longest: 244194 num: 1 RECORDS_TABLE.insert.create_infile 04 Aug 2011 17:01:01,896 DEBUG [Thread-3] - TimingLogger! total: 0 avg: n/a longest: 0 num: 0 RECORDS_TABLE.insert.load_infile 04 Aug 2011 17:01:02,504 DEBUG [Thread-3] - TimingLogger! total: 627 avg: 627.00 longest: 627 num: 1 System.gc 04 Aug 2011 17:01:03,170 DEBUG [Thread-3] - 04 Aug 2011 17:01:03,171 DEBUG [Thread-3] - Free memory: 0 MB. 04 Aug 2011 17:01:03,795 DEBUG [Thread-3] - Used memory: 63 MB. 04 Aug 2011 17:01:03,795 DEBUG [Thread-3] - Increased by: 7 MB. 04 Aug 2011 17:01:04,402 DEBUG [Thread-3] - Total memory: 63 MB. 04 Aug 2011 17:01:05,109 DEBUG [Thread-3] - Max'm memory: 63 MB. 04 Aug 2011 17:01:05,110 DEBUG [Thread-3] - *****

I'm not sure what may have caused it to run out of

Stephen

patrickzurek commented 8 years ago

JIRA Coment by user: banderson JIRA Timestamp: 2011-08-16 05:51 PM

Comment body:

we've got an ongoing list of 3 thus far:

seem to be good: http://quod.lib.umich.edu/cgi/o/oai/oai http://dl.acs.org.au/index.php/index/oai http://jsnc.library.caltech.edu/perl/oai2

these ones seem bad: http://www.asdlib.org/oai/oai.php $ curl -s 'http://www.asdlib.org/oai/oai.php?verb=ListRecords&metadataPrefix=oai_dc&from=1960-02-01&until=2011-01-01' <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">

2011-07-27T09:44:32Z http://www.asdlib.org/oai/oai.php The argument from is an illegal year

http://dspace.maktabat-online.com:8080/oai/request

$ curl -s 'http://dspace.maktabat-online.com:8080/oai/request?verb=ListRecords&metadataPrefix=oai_dc' returns an error - java.text.ParseException: Unparseable date: "0001-01-01T00:00:00Z&quot

patrickzurek commented 8 years ago

JIRA Coment by user: rcook JIRA Timestamp: 2011-08-16 06:02 PM

Comment body:

The "good ones", are those the ones fro, the Saudi folks? Did you actually harvest records that were then visible in the UI Browse Records screen?

patrickzurek commented 8 years ago

JIRA Coment by user: banderson JIRA Timestamp: 2011-08-17 01:28 PM

Comment body:

http://jsnc.library.caltech.edu/perl/oai2 worked fine for me.

http://dl.acs.org.au/index.php/index/oai worked fine for me.

hathitrust worked fine for me.

I got the MST_General_log.txt file from Ahmed. It shows a load of problems which I will document.