Closed JR-1991 closed 6 months ago
That worked!!! Thanks for this.
I haven't tried those updates yet but I have noticed a significant slowdown with the "Registering files".
@DonRichards, sorry for the delay in response. Yes, this is a bottleneck, unfortunately. I have tried to extend the maximum concurrency of registration tasks, but it failed. Dataverse likely struggles to process many requests simultaneously and simply errors out if there are too many.
I have added a soft fix for this by allowing requests to be retried upon failure. Although this is not a guaranteed speed-up, it might be helpful to increase performance slightly. Would you mind trying it out to see if it helped in your case?
If this is still too slow, an option would be to divide your files into multiple tar
archives and upload each. This way, there are fewer requests to process.
Any suggestions on how to trace why the registration of files has stopped working suddenly? Is there a way to see what's causing the registration to fail? "An error occurred with uploading: Connector is closed."
@DonRichards this is most likely due to Dataverse shutting down the connection due to too many requests. I am still trying to find a sweet spot, but it varies greatly between instances. You can only traceback the actual error within the logs of your Dataverse instance.
@DonRichards good news! I have talked to the Dataverse Dev Team, and there is a way to register bulk data at Dataverse without requiring a request per file. Hence, the registration is now way faster and more stable.
I have just pushed the changes to this PR and prior tested it with 10k small files locally without any issues. Do you mind testing the updated PR?
Tested it with batches of 200 files at a time and it works as expected.
@DonRichards thanks for testing! Does this resolve your issue #7?
I do believe so. Thanks! I really appreciate the work.
@DonRichards perfect! Will merge this PR then to close the issue #7
Overview
In issue #7, it was highlighted and discussed that direct upload of a single file (not multipart) to an S3 storage raises a
Not implemented
exception on AWS side. This issue is related to streaming files for POSTing to the S3 storage. To tackle this issue, thefile_sender
function has been removed and replaced with a simpleopen
function to upload a file. Additionally, this PR introduces some printing enhancements and allows to force native upload.Changes
open
instead offile_sender
for file uploads.Closes
closes #7