NCAR / joai-project

jOAI is an OAI-PMH data provider and harvester Java web application
Apache License 2.0
16 stars 8 forks source link

Index incomplete #6

Closed joergklausen closed 4 years ago

joergklausen commented 4 years ago

In the Metadata Files Configuration, I consistently experience the problem that only about 45-50% of the XML records on disk end up being available through the jOAI provider. Currently, ca. 40'000 files are present on disk and are listed as 'Successfully indexed', however, the 'Total number of items in the index is only about 16'000. I can provide more details once I see that someone takes an interest in this problem. I have not been able to get in contact with any of the developers directly through e-mail so far.

jweatherley commented 4 years ago

I may be able to help. Please send more information about what you are experiencing. Are there any errors in the indexing report?

joergklausen commented 4 years ago

Dear John thanks for responding! I am not aware of a special indexing report, where would I find it? Here's a screen capture of the indexing process.

screen_capture

The test environment is accessible at https://oscardevt.meteoswiss.ch/oai/oaisearch.do, although all of the admin part is not. I could also setup a Webex meeting to share screens for further investigation (need an e-mail to send an invitation to, though). Kind regards Jörg

joergklausen commented 4 years ago

I should mention that not all of the XML files validate against the specified XSD, e.g. https://oscardevt.meteoswiss.ch/oai/provider?verb=GetRecord&metadataPrefix=wmdr&identifier=oai:meteoswiss.ch:0-20008-0-JFJ. The missing elements are indicated in a comment section at the beginning of the file. However, all of these files are well-formed XML.

jweatherley commented 4 years ago

jOAI does not validate, it only requires that the files are well-formed.

One thing to verify is that the OAI identifiers are unique. If there are duplicate identifiers, only the first one that is encountered will be included.

On Wed, Mar 25, 2020 at 10:00 AM Jörg Klausen notifications@github.com wrote:

I should mention that not all of the XML files validate against the specified XSD, e.g. https://oscardevt.meteoswiss.ch/oai/provider?verb=GetRecord&metadataPrefix=wmdr&identifier=oai:meteoswiss.ch:0-20008-0-JFJ. The missing elements are indicated in a comment section at the beginning of the file. However, all of these files are well-formed XML.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCAR/joai-project/issues/6#issuecomment-603925113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVC7KDGZRYSJ3PDMN6Y23RJITB7ANCNFSM4LSZVZQQ .

joergklausen commented 4 years ago

The identifiers should be unique, the XML files are named after the identifiers, and I have close to 40'000 unique files at present. Any other idea?

jweatherley commented 4 years ago

If you can identify a small subset of the records that are not included in your index but should be, say 1 or 2 records, try indexing them in a separate jOAI instance. If they do not index there as well, it should be easier to inspect the file contents and/or indexing log for any hints as to the problem. If there are no obvious content or indexing errors, try editing and re-indexing one of the files until you are able to get it to index successfully and work from there.

On Wed, Mar 25, 2020 at 1:11 PM Jörg Klausen notifications@github.com wrote:

The identifiers should be unique, the XML files are named after the identifiers, and I have close to 40'000 unique files at present. Any other idea?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCAR/joai-project/issues/6#issuecomment-604031912, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVC7MEMF67HUVHB5X5XRDRJJJNNANCNFSM4LSZVZQQ .

joergklausen commented 4 years ago

Hi there thanks for you advice so far. I cannot easily install another instance of oai (I think), but what I have done is to move all except a few of the XML files to another location. I would expect that the 'Reindex all files' command should consider that. It didn't. I also used the advanced option to 'Reset the index'. In both cases, the Total number of items in index remained at 16537 (I expected 3). Is there another way to wipe the index and start over again?

jweatherley commented 4 years ago

When files are removed from the directory, the jOAI 'Reindex all files' command does not remove the items from the index but rather marks the items as deleted (see the OAI-PMH spec section 2.5.1 regarding deleted records), and the index size remains the same. Marking the items as deleted notifies the remote harvesters to also delete them from their repositories.

Using the 'Reset the index' option will wipe the index and create a new one from scratch, assigning new datestamps to each of the items and removing any prior record of deletions. In this case the new index size should match the number of records (e.g. 3).

I just checked and your repository index appears to have been wiped and is now reporting 0 items: https://oscardevt.meteoswiss.ch/oai/oaisearch.do

The record example you sent earlier is also gone, e.g. https://oscardevt.meteoswiss.ch/oai/provider?verb=GetRecord&metadataPrefix=wmdr&identifier=oai:meteoswiss.ch:0-20008-0-JFJ .

The jOAI FAQ may have some useful information to help troubleshoot: https://oscardevt.meteoswiss.ch/oai/docs/faq.jsp

As well as the Data Provider Documentation: https://oscardevt.meteoswiss.ch/oai/docs/provider.jsp

If there are no indexing errors reported under 'Indexing Errors' in the the Metadata Files Configuration page and the files are still not matching what you expect, another place to check for error messages is in the Tomcat catalina.out log file (or equivalent for other servlet containers) - jOAI may output messages there in certain cases.

https://oscardevt.meteoswiss.ch/oai/oaisearch.do

On Thu, Mar 26, 2020 at 7:27 AM Jörg Klausen notifications@github.com wrote:

Hi there thanks for you advice so far. I cannot easily install another instance of oai (I think), but what I have done is to move all except a few of the XML files to another location. I would expect that the 'Reindex all files' command should consider that. It didn't. I also used the advanced option to 'Reset the index'. In both cases, the Total number of items in index remained at 16537 (I expected 3). Is there another way to wipe the index and start over again?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCAR/joai-project/issues/6#issuecomment-604430222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVC7NIHS6MJIRBOHZJYSTRJNJ3HANCNFSM4LSZVZQQ .

joergklausen commented 4 years ago

With help from @jweatherley (Kudos!!) and our sysadmins, it seems we finally traced down the cause of the problem to insufficient allocation of disk space for the repository and the index. If others run into similar issues, my advice is to inspect catalina.out!