IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 486 forks source link

Improving perceived DOI registration/publicizing speed. #7393

Closed qqmyers closed 1 month ago

qqmyers commented 3 years ago

With file PIDs enabled, and with large datasets (100s to 1000s of files), the time to register files at creation/upload (implemented in #7334 ), and to publicize those DOIs during publication, is large enough to be noticeable (simple tests on test servers show something like 2 DOI changes per second so ~10 minutes for 1K files). Given realistic file sizes, it isn't clear whether this is that significant during uploads (presumably 1000 'average' files take longer than 10 minutes to upload for most people?), but it is probably a big part of the wait at publication.

I'm opening this issue to consider options for how the real or perceived performance might be improved. One option would be to make the changes asynchronously, e.g. the way that archiving can be done in a post-publication workflow. This would result in the user seeing file DOIs appear over time, or becoming public over time rather than being assigned/public when the save/publish operation completes. (Actually, since it takes significant time for DataCite to push newly publicized DOIs out to its index servers, there's already a delay between publish and the DOIs being resolvable).

It may also be possible to speed submission by parallelizing the calls to DataCite, or, if those need to be throttled and the delay is partially in Dataverse preparing the metadata to send, just parallelizing the prep steps and putting the DataCite calls in a pool.

Some of things that would help prioritize and refine this issue would be to know how much of a concern this is to installations and to know whether others have experience in sending large numbers of DOI requests to DataCite and or know what performance DataCite supports and whether they have bulk/asynchronous calls we could use.

I may investigate some, but this issue may sit for a while without community input.

djbrooke commented 3 years ago

Thanks @qqmyers for creating the issue.

@scolapasta and @landreev I'm moving this to Needs Discussion so that we can... discuss in a tech hour. I'd like to figure this out before merging this PR:

We have messaging built up at publish time to handle a potential long delay around reserving file PIDs, but if we move the File PID reservation up in the workflow we could see a situation where there's not delay-related messaging and a long wait.

qqmyers commented 3 years ago

FWIW: Had an exchange with DataCite on this: As a developer working on Dataverse (https://github.com/IQSS/dataverse), I’d like to understand how we should interact with the DataCite API, particularly when performance is an issue. Specifically, it is an option in Dataverse to assign a DOI to individual files within a Dataset, and a Dataset may contain more than 1K files. Dataverse currently tries to register/publish all the DOIs for a Dataset and its DataFiles when a user presses the ‘publish’ button. (We’re moving to register file DOIs when files are uploaded and only adjust metadata/make the DOI findable at publication time.)

The general performance we're seeing in Dataverse is that we can register or publicize ~2 DOIs per second. That's reasonable performance for most uses, but it adds ~10 minutes to the time required to upload or publish Datasets with 1000 files (and we have real cases with at least double that number in the community).

We’re considering a model where DOI updates would be made asynchronously in our code, i.e. the user would click ‘publish’ (or ‘upload’ in the future) and we’d show a success message with a note that DOIs will become public (or just registered for uploads) in the near future, at which point we can start running through a queue of needed changes.

We’d like to understand whether we need to do this only for larger datasets / as a backup if DataCite is busy/down or whether we should do this all the time. That depends somewhat on the performance we can get when making DOI changes.

Some of the required time now (~1/2 sec per DOI) is certainly in Dataverse itself, so it would be useful to know what performance we could/should expect from DataCite in terms of throughput. Do you have such information? Is there a way to increase the speed of multiple operations? Can we parallelize our calls or would that cause server load issues at DataCite? Are there any bulk or asynchronous APIs available/planned that would help? (Most of our uses cases would be for a batch of DOI changes to happen during dataset creation or publication, but the load itself would be fairly intermittent from any give Dataverse instance.) If we move to asynchronous changes, is there a base rate we should target? Is there a way we could/should detect if DataCite is under load and delay/throttle updates?

Their response: So the current status is that we do not impose per account limits, however there is a top level hard limit imposed by our firewall which is based on IP, this is around 3000 requests in a 5 minute window. We are looking to potentially implement account rate limits in the future.

To try and go through your specific questions

  1. We don't have specific performance metrics for registering DOIs, so we cant give you a guarantee on throughput.
  2. One way to improve performance is to make less API calls, if you use the REST API you can make 1 call to register both the metadata and the DOI as opposed to the MDS API which needs 2.
  3. You can send calls in parallel being mindful of the top level rate limit, its preferably not to hammer our servers however.
  4. We have no specific plans for bulk APIs at present however it has been discussed occasionally, so it's not beyond possibility.
  5. If you move to async, then as I said before, we don't have specific numbers, but bearing in mind our rate limit, I'd perhaps try starting with 5 requests per sec, if they are just all in one call DOI registrations, this would reduce your time down to a few minutes, however because we dont have throughput information this is not guaranteed.
  6. If you go over the rate limit, you will not be able to send any more DOIs. At this time we don't have anything more user friendly, though the idea of rate limiting per account would then then allow us to give better responses.

So -it seems like we should be able to increate our rate if we can parallelize, we should check the use of mds vs rest api to see if we're doing as few calls as possible, and we have an overall rate limit to handle.

joeHickson commented 3 years ago

It might be possible to partially resolve this by using an offline queuing mechanism, decoupling file submission with all the slow processing steps involved. This would possibly require a new file status of "submitted" or similar. I'm assuming the current process is single threaded and linear, something like (not having checked the code):

submit file -> store file to temporary disk -> process metadata -> validate file -> convert to alternative format (e.g. csv -> tab) -> persist file data -> move valid file to permanent location -> get doi -> update file data with new doi -> return success message

Instead you could decouple your submission from the final processing allowing faster response times to the initial submission. As dataset publication is already a non real-time process (you get locked datasets whilst they are being published) having to wait for all data to be ready before finishing publication shouldn't be a problem for user workflow. e.g.

submit file -> store file to temporary disk -> process metadata -> persist data -> return success message (submitted, no doi, does have dataverse id as there's a record in the database)

offline queue then does two things: 1) process file doing conversion / relocation, updating metadata as needed. 2) submit a request for a doi. Once this completes the file has a doi on the metadata

On completion of 1&2 - ingestion complete, file now draft not submitted.

Publish then only completes once 1 & 2 have both completed for all files in the dataset, but can be submitted with files still outstanding (on the provision that all files in a draft dataset resolve, any failures would cause the publication to also fail).

The processing of 2 is then a separate job that can run with as many workers as the rate limit will allow. If you max out the rate you don't drop any data submissions to dataverse, just backlog the doi requests (and publication), stopping the workers for x seconds until the rate limit is cleared and then continuing with the backlog. As these are just workers pulling from a job pool they could run from different ip addresses but that might be getting a bit spammy on the old api. Likewise if a bulk api becomes available they could pick up multiple submissions to perform at a time, pulling x from the queue at once.

zbenta commented 1 year ago

Hello there, any news on this topic? We are trying to do a stress test to our infrastructure and we are using jmeter to do api calls to dataverse. We are facing a similar situation, were we have 100 users creating datasets and uploading 2 small files to each dataset and we are having wait times of about 600 seconds since the moment the user created the dataset and uploaded the files. We suspect that datacite is the culprit and this issue made us think we are right on our suspicions.

pdurbin commented 1 year ago

@zbenta not really. I'm somewhat surprised you're seeing performance problems with that scenario. 600 seconds? 10 minutes? That's super slow. I guess with jmeter you could test with the FAKE DOI provider to see if the slowness is from DataCite or not, which I sort of doubt (and @qqmyers agrees, from a quick conversation in Slack).

We do some perf testing with locust before release: https://guides.dataverse.org/en/5.13/developers/testing.html#locust

But I'm not sure if they cover your scenario. Anyway, if it turns out to not be a DataCite thing, please feel free to open a new issue.

zbenta commented 1 year ago

@pdurbin we tried going back to the FAKE DOI, removed al the settings of our account and still, we could see comunication between our dataverse instance and datacite.

Here you can see the output:

image

We tried opening the resulting pcap file in wireshark, but the comunication is encrypted and we can't see what are they "chatting" about.

image

pdurbin commented 1 year ago

@zbenta I believe we got FAKE working for you a few days ago: https://matrix.to/#/!AmypvmJtUjBesRrnLM:matrix.org/$lt7_z6ks7msSTbPmJ-IsM5h2erKSLBbrkDtG-awFD8w?via=matrix.org&via=bertuch.name&via=sztaki.hu

zbenta commented 1 year ago

Yes @pdurbin , thanks for the tip.

qqmyers commented 1 month ago

Low priority now that file PIDs are optional/probably won't be used for datasets with large numbers of files.