IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 489 forks source link

retry upload if it failed? #9696

Open alejandratenorio opened 1 year ago

alejandratenorio commented 1 year ago

Dear Dataverse Support,

What steps does it take to reproduce the issue? You start uploading a large file, then your network goes down for a short time.

image

No matter the issue, screenshots are always welcome.

To add a screenshot, please use one of the following formats and/or methods described here:

pdurbin commented 1 year ago

@alejandratenorio hi! We don't have a great solution for restarting file upload. We did add support for rsync but we're probably going to remove it or at least deprecate it:

Do you happen to store your files on S3? I'm asking because there's a feature we call S3 direct upload where the files travel from the user's computer directly to S3 instead of passing through Dataverse.

alejandratenorio commented 1 year ago

Hi @pdurbin,

Thanks for your response. Yes, we store on S3 and have enabled S3 direct upload. We upload many files simultaneously per dataset and usually don't have problems. But when our network goes down, we have to restart the upload manually.

We will be upgrading our Dataverse soon, We have an increasing need to upload more and more files per dataset and would use rsync as a solution to upload it, but if it were to be removed, what other tool could we use to facilitate file uploads?

Thanks,

pdurbin commented 1 year ago

Hmm. Another option might be Globus.

I checked with the team and @qqmyers had this to say (thanks, Jim):

"Globus does do retries, not sure when it does partial retries (not resending bytes that made it)."

Here's a handy link to the docs: https://guides.dataverse.org/en/5.13/developers/big-data-support.html#globus-file-transfer

Another workaround might be to keep the files as zips. But there are tradeoffs. 😬

@ErykKul was recently talking about a dataset with thousands of files in #9558. Maybe he has some thoughts.

You could also ask at https://groups.google.com/g/dataverse-community of course! 😄

qqmyers commented 1 year ago

FWIW: If the issue is failures where whole files have been uploaded (and not partial files), a tool like the DVUploader might help - it can be run repeatedly to upload n files at time (versus trying to upload all files in a long list at once). If that works, it may not be too hard to add a similar limit in the UI direct upload and dvwebloader plugin. (Those both try to register all files with Dataverse at once since that is most efficient, but they could be changed to push every n files. This could raise the issue of trying to explain partial successes in the UI - the dvwebloader might be better there since it already can detect and show when files on your disk already exist in the dataset.) In any case, those types of changes would require some programming work, but the DVUploader could be scripted now/with the current Dataverse release, etc.