gdcc / python-dvuploader

✈️ - Python package for parallel direct upload to Dataverse
MIT License
4 stars 0 forks source link

Fix singlepart direct upload #8

Closed JR-1991 closed 6 months ago

JR-1991 commented 7 months ago

Overview

In issue #7, it was highlighted and discussed that direct upload of a single file (not multipart) to an S3 storage raises a Not implemented exception on AWS side. This issue is related to streaming files for POSTing to the S3 storage. To tackle this issue, the file_sender function has been removed and replaced with a simple open function to upload a file. Additionally, this PR introduces some printing enhancements and allows to force native upload.

Changes

Closes

closes #7

DonRichards commented 7 months ago

That worked!!! Thanks for this. Screenshot from 2024-02-12 16-34-09

DonRichards commented 7 months ago

I haven't tried those updates yet but I have noticed a significant slowdown with the "Registering files".

Screenshot of the Registering files

JR-1991 commented 6 months ago

@DonRichards, sorry for the delay in response. Yes, this is a bottleneck, unfortunately. I have tried to extend the maximum concurrency of registration tasks, but it failed. Dataverse likely struggles to process many requests simultaneously and simply errors out if there are too many.

I have added a soft fix for this by allowing requests to be retried upon failure. Although this is not a guaranteed speed-up, it might be helpful to increase performance slightly. Would you mind trying it out to see if it helped in your case?

If this is still too slow, an option would be to divide your files into multiple tar archives and upload each. This way, there are fewer requests to process.

DonRichards commented 6 months ago

Any suggestions on how to trace why the registration of files has stopped working suddenly? Is there a way to see what's causing the registration to fail? "An error occurred with uploading: Connector is closed."

JR-1991 commented 6 months ago

@DonRichards this is most likely due to Dataverse shutting down the connection due to too many requests. I am still trying to find a sweet spot, but it varies greatly between instances. You can only traceback the actual error within the logs of your Dataverse instance.

JR-1991 commented 6 months ago

@DonRichards good news! I have talked to the Dataverse Dev Team, and there is a way to register bulk data at Dataverse without requiring a request per file. Hence, the registration is now way faster and more stable.

I have just pushed the changes to this PR and prior tested it with 10k small files locally without any issues. Do you mind testing the updated PR?

DonRichards commented 6 months ago

Tested it with batches of 200 files at a time and it works as expected.

JR-1991 commented 6 months ago

@DonRichards thanks for testing! Does this resolve your issue #7?

DonRichards commented 6 months ago

I do believe so. Thanks! I really appreciate the work.

JR-1991 commented 6 months ago

@DonRichards perfect! Will merge this PR then to close the issue #7