Spike: Datacite request to avoid creating unecessary file DOIs

mreekie commented 1 year ago

Description:

Determine how best to respond to this issue.

Options include prioritizing the work to make file DOIs be off by default and able to be turned on per collection, and/or digging further into technical options and the tradeoffs involved in those. The conversation below covers the initial discussions (in email).

Context:

We received a report from Philipp (CC'd) of an issue registering file-level DOIs for a dataset with many files. From my understanding, there was an error publishing the dataset within Dataverse, but the DOIs were registered anyway. In this case, this resulted in almost 28,000 file-level DOIs being registered. Ultimately, the dataset was successfully published with a much smaller number of files, meaning most of the 28K DOIs were not necessary.

As a result of this report, I was wondering about the timing of DOI registration relative to dataset publication in Dataverse. I remember when DataCite had the December 1 outage, there were reports of Dataverse users being unable to publish datasets due to failed DOI registration.

My thinking is that (ideally) DOIs shouldn't be registered if there is an error within Dataverse, and an error on DataCite's end shouldn't prevent dataset publication. Do you have any insight into how this works, and if there is any planned development in this area?

sbarbosadataverse commented 1 year ago

This is definitely tricky. We wrap everything on the Dataverse side into a transaction and call DataCite as one of the last steps. That means most/all of the failures where file PIDs are created are because there’s been a failure interacting with DataCite. We’ve already moved things like reindexing that can be problematic outside the transaction so publication proceeds in those cases. I think we used to leave the dataset locked if a failure occurred which would mean an admin would have to unlock it before any attempt to republish or to make changes to the dataset could occur. That was pretty burdensome but without that lock, for Dataverses that allow self publishing, there’s not much that can be done to stop users from making changes before publishing again.

Aside from having transaction support at DataCite, I think the best we could do technically would be to allow publication to complete and report DOI registration failures to admins so something can be done manually. That potentially leaves a Dataset advertising file DOIs that aren’t public and/or don’t exist yet at DataCite which would be problematic as well. (FWIW: We register datasets early and just publicize them at publication but my PR to register file DOIs during upload has been sitting over concern that it makes upload slower/potentially causes upload to fail instead.)

If there’s evidence that sending thousands of requests is the problem (versus some other outage), we could should be able to add some throttling and/or retries to avoid some of the problems.

Perhaps the best way to minimize the problem in a practical sense would be to prioritize the work to have file PIDs turned off by default and to make them configurable per collection. That would help limit problems to when people really do want file DOIs and would be willing to wait to retry publication, etc. Switching the default would be trivial and I think even making file PIDs selectable per collection is small enough it could get into the next release. Sonia raising that with Stephano would be the way to make sure that happens.

Hope that helps. It may be that Gustavo or others have additional ideas of how we could reduce the chances of this happening – I’m still reloading Dataverse into my head after break.

Jim

mreekie commented 1 year ago

@siacus This may be a candidate when we apply for additonal work.

sbarbosadataverse commented 1 year ago

Response to Jim from Kelly Stathis (DataCite):

Hi Jim,

This does help, thanks! I think I confused things by grouping two issues together in my initial email (I am also reloading work into my head after break!). With the original issue Philipp encountered, the ~28K DOIs were registered despite the dataset publication failing. So in this case, there wasn't a failure interacting with DataCite (that we can tell), but a failure in Dataverse related to a large number of files. (Philipp, do you have any further details on the error for this dataset?)

If I have this right now: DOI registration, including file DOI registration, happens before publication: If DOI registration fails, it blocks publication. If DOI registration succeeds, publication can proceed (and then either succeed or—occasionally—fail).

I would hesitate to register DOIs upon file upload, but for a different reason: the files being uploaded aren't necessarily finalized until the dataset is published. For example, researchers may upload a version of their dataset, then realize they need to make a change to the files, and remove and re-upload them.

There is a draft state that can be used to reserve a DOI before it is registered. Because drafts can be deleted, this could be an option for reserving DOIs upon file upload—if a researcher deletes a file, you can just delete the draft DataCite record. However, I'm not sure how much time this saves (you'd still need to send a PUT request upon dataset publication to change the state from Draft to Findable).

I agree that turning off file PIDs by default would be beneficial; Sonia, it would be great if that is something you can raise with Stefano! Let me know if I can provide any information to support this.

Many thanks, Kelly

sbarbosadataverse commented 1 year ago

Some comments/thoughts inline.

Cheers,

-- Jim

So in this case, there wasn't a failure interacting with DataCite (that we can tell), but a failure in Dataverse related to a large number of files. (Philipp, do you have any further details on the error for this dataset?)

It’s possible that this is just some memory limitation in the final commit, or perhaps in the loop to create DOIs itself, but I don’t think there is anything iterating through files after the DOIs are publicized. It would be the case if the last file failed at DataCite that all the previous ones would have been created/made public at DataCite and we would then rollback the dataset to unpublished.

If I have this right now: DOI registration, including file DOI registration, happens before publication

DOIs are made public during the publication transaction. In the case of file DOIs, this is also the first time we contact DataCite about those DOIs.

If DOI registration fails, it blocks publication.

Yes - the transaction rolls back if any of the many file DOI calls fail.

If DOI registration succeeds, publication can proceed (and then either succeed or—occasionally—fail).

Yes – there is not much that is done besides completing the transaction at that point, but that could potentially cause a memory issue. Things that we know can fail, and that use more memory, like indexing, already occur after the publication transaction ends and if they fail, the dataset remains published.

I would hesitate to register DOIs upon file upload, but for a different reason: the files being uploaded aren't necessarily finalized until the dataset is published. For example, researchers may upload a version of their dataset, then realize they need to make a change to the files, and remove and re-upload them.

There is a draft state that can be used to reserve a DOI before it is registered. Because drafts can be deleted, this could be an option for reserving DOIs upon file upload—if a researcher deletes a file, you can just delete the draft DataCite record. However, I'm not sure how much time this saves (you'd still need to send a PUT request upon dataset publication to change the state from Draft to Findable).

Yes – were aware of draft. We/I use the term ‘registration’ to imply contacting DataCite to create the DOI in the draft state. ‘Publicize’ is when we move to Findable. Before registering/creating the Draft DOI, the DOI has been generated in our software but DataCite doesn’t know about it. (Since Dataverse is usually the only software using a given authority/shoulder, we haven’t run into collisions but the reason to ‘register’ earlier is to get that draft state record into DataCite.)

With dataset DOIs, we register/create the draft version when the draft dataset is created. If someone then deletes the dataset, we go delete the DOI at DataCite as well. For datasets, it is only the change to findable that is happening during publication.

One other technical option that I didn’t think of yesterday that we could possibly do, aside from create the draft file DOIs during upload, would be to create the draft DOIs inside the publication transaction and move the task to make the DOIs public to run after the transaction is complete. The advantage there is that the DOIs would be in a deletable state until the dataset was truly published, but it would mean making twice as many calls to DataCite. (I think we manage 2-3 per second so 56K instead of 28K is a fairly long time.) As with the idea of just doing all the DOI work after publication, this could leave a dataset having unpublished DOIs unless/until an admin intervenes. However, If the publication itself failed, it would mean that the draft DOIs could be deleted. Definitely a bigger project than making the default to be no file PIDs and/or making that configurable per collection. (Or perhaps having a setting to let admins decide whether there are too many files, i.e. datasets with less than 100 files get PIDs and bigger datasets don’t? Since DOIs are a cost issue, that might be the most direct way to avoid unexpected charges.)

mreekie commented 1 year ago

Priority Review with Stefano:

Moved from Harvard Dataverse Instance to Ordered Backlog

mreekie commented 1 year ago

Sizing:

This came in from the email conversation - see above
Spike is to scope our response to the issue.
The discussion needs to occur, starting with Sonia, Stefano, Julian. Jim has volunteered to represent the dev on this discussion. (His opinion is expressed above). Phil also can provide input.
this is the span of a meeting: size: 3

pdurbin commented 1 year ago

prioritizing the work to make file DOIs be off by default

@aialves just mentioned she was surprised that files get PIDs by default...

... I do think the fault should be that files don't get PIDs.

mreekie commented 1 year ago

See also: #5283 These 2 issues are related and when this issue goes to QA, make sure that the problems raised in that issue are covered. Also, consider using/requesting the batch api from DataCite.

kcondon commented 1 year ago

I just tested this on test with 1000 files and it registered them without error. This is one case but it works. This successful test does not address handling of failures due to other causes -slow service/service drop.

cmbz commented 11 months ago

2023/09/18

We have no mechanism for handling errors/timeouts on requests from external services
Was successfully tested on a smaller dataset (1000 files) but did not test original use case of 28,000 files, and was tested in the DataCite test environment, not in their production environment (e.g., there may be differences in performance between the two)
This is an example of a general class of a design problem involving interaction with external services. Other examples include: DOI minting/publishing datasets, controlled vocabularies.
Closing this issue now, but we should follow-up with a technical discussion about options for addressing the problem as a class rather than a one-off.

qqmyers commented 11 months ago

FWIW: This is also ameliorated by the new functionality to allow file DOIs to be enabled/disabled per collection and switching the default to not enable file DOIs. This doesn't solve the technical issue of leaving a dataset in a draft state after file DOIs have been minted/made public for some files if there has been a Dataverse-DataCite communication error, but it should help admins limit the cases where file DOIs would be used accidentally.

IQSS / dataverse

Spike: Datacite request to avoid creating unecessary file DOIs #9272