Harvesting : message "javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: java.lang.NullPointerException"

arnaumevi commented 1 year ago

Hi, I'm having trouble harvesting Clients with the Dataverse 5.11.1 version. I get the message javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: java.lang.NullPointerException on the server log

Client configurations:

Alias : UAB
Server URL : https://ddd.uab.cat/oai2d
OAI Set : datasets
Metadata Format : oai_dc
Archive type Generic OAI archive Results : SUCCESS; 0 harvested, 0 deleted, 78 failed.

Here is the log for the attempt : harvest_UAB_2023-01-24T13-21-32.log

Thank you for your time in advance, Best Regards, Arnau

landreev commented 1 year ago

Thank you. Just to confirm, you WERE able to harvest from this OAI archive successfully, before upgrading to 5.11.1, correct?

landreev commented 1 year ago

A quick followup: This isn't mentioned in this issue here, but the original report in the Google group suggests that these failures started happening after the upgrade to 5.11.1. Having looked at this OAI server and the failures, I don't think these OAI_DC records would have been imported successfully by any version of Dataverse. So if you were able to harvest from this archive previously, they must have changed their record format on the server side since then.

The short answer is that Dataverse can't import these OAI_DC records because they don't have persistent identifiers in any of the <dc:identifier> fields, for example:

  <dc:identifier>https://ddd.uab.cat/record/166606</dc:identifier>
  <dc:identifier>urn:oai:ddd.uab.cat:166606</dc:identifier>
  <dc:identifier>urn:10.5565/ddd.uab.cat/166606</dc:identifier>
  <dc:identifier>urn:articleid:14712202</dc:identifier>

i.e. Dataverse wants one of these fields to contain either a DOI or a Handle identifier.

This is our fault, in more than one way:

It obviously shouldn't be failing in such a confusing, unclear manner. (There's nothing informative in that harvesting log; and there's a mess of stacktraces left in the main server.log).
We may not really need to enforce this requirement, that a dataset must have a persistent id, on harvested datasets. (as opposed to "real", local datasets). All we need is some working url that we can use to redirect the Dataverse user back to the archival location of the data; and the first of the identifiers in the record above is a valid url that we could use for that. It becomes more difficult/less reliable, to ensure that we are not importing duplicate copies of the same data record without persistent ids, but then again, duplicates are probably much less of a problem with harvested datasets.

We have an open issue for improving the client-side harvesting functionality that should address 1. above - we'll make more and better diagnostics visible to the admin; I'm hoping that it will be prioritized and addressed soon. As for 2., I have brought this up with the dev. team and we at least started talking about this.

But, unfortunately, this is not something we can fix for you, and/or something you can fix with a configuration change, right away.

tjouneau commented 8 months ago

Is related to the previous issue :

7546

IQSS / dataverse

Harvesting : message "javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: java.lang.NullPointerException" #9318

7546