IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
888 stars 490 forks source link

Trying to set up or complete a harvesting client through the API crashes Dataverse #8290

Closed tjouneau closed 2 years ago

tjouneau commented 2 years ago

What steps does it take to reproduce the issue?

where client.json is like this (I removed only the informations about the last harvests) :


{
    "nickName": "zenodo_lmops",
    "dataverseAlias": "lmops",
    "type": "oai",
    "harvestUrl": "https://zenodo.org/oai2d",
    "archiveUrl": "https://zenodo.org",
    "archiveDescription": "Moissonné depuis la collection LMOPS de l'entrepôt Zenodo. En cliquant sur ce jeu de données, vous serez redirigé vers Zenodo.",
    "metadataFormat": "oai_dc",
    "set": "user-lmops",
    "schedule": "none",
    "status": "inActive",
  }

The answer to the curl command was as follows and Dataverse / Payara went down.

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>

The server.log file did not show anything particularly relevant, just stopping at :

[2021-12-02T09:00:30.635+0100] [Payara 5.2020] [INFO] [] [edu.harvard.iq.dataverse.api.HarvestingClients] [tid: _ThreadID=89 _ThreadName=http-thread-pool::jk-connector(1)] [timeMillis: 1638432030635] [levelValue: 800] [[
  retrieved Harvesting Client zenodo_lmops with the GetHarvestingClient command.]]

Which version of Dataverse are you using?

Any related open or closed issues to this bug report?

landreev commented 2 years ago

This was discussed and the decision was made to keep the Create/Edit/Delete APIs superuser-only. (as implemented, a user with edit permission on the host collection was allowed to create and modify clients). From the slack discussion:

kcondon: Would like help on sorting out behavior of harvesting client api. As tested, it allows collection admins to create and modify harvest clients, just not delete them. In the ui only super users can do this. Is this what we want? A significant possible downside, without additional coding, is that two collections harvesting from the same source/set would collide and potentially get partial lists, since a dataset can only exist once in the app.

landreev: I can confirm that it’s implemented like this :arrow_up: on purpose. But have no recollection of why. (it’s implemented on the command level; but in the ui only the superusers can get to the harvest dashboard) Kevin has a point - it’s a bit strange. IMO, this is a bit out of scope - but it’s not too much effort to make the api superuser only. We were wondering if anyone else has thoughts, etc. The rationale may have been as simple as “we allow people to add linked content to collections, why not allow them to harvest also…” But Kevin’s argument - what if 2 diff. collections decide to harvest from the same remote archive? - does show that it’s impractical.

pdurbin: I’m fine with superuser only for all operations.

Julian: I agree about making the endpoints superuser only. But does super-user only endpoints conflict with the user story? If all three endpoints are made superuser only, will someone want to create a new issue about letting non-superusers manage harvesting clients?

landreev: The more I think about it, the less I can think of any practical value of letting non-superusers create and/or mess with harvesting clients. And, to be clear, “superuser-only” here means that it’ll stay under /api/harvest/clients; so somebody with a superuser api token - like you - would be able to use it remotely; it’s not going to be a localhost-only api) But, for non-superuser, collection-level admins it looks like the scenario should be: if they want some content harvested and show up in their collection, they should ask support/superuser admin to set up the harvest and get that content; and then they can link it into their collection if they so desire. If anyone else wants these harvested datasets to show in their collection, they can link them too. Avoiding the mess of 2 different collection trying to harvest the same archive (and both getting only parts of it; or maybe the one that harvests earlier in the day getting all the datasets, etc.)