Open pallinger opened 2 years ago
Forgive me it this isn't relevant. Uploading really large files - in my case Lidar data - I use an s3 bucket set for direct-upload. Now that doesn't work with pyDataverse but for uploading really large files individually a direct-upload bucket is helpful.
I understand that this is not relevant for you. However, if the dataverse installation in question does not use an s3 storage backend, then this becomes instantly relevant.
The issue is, i am on parental leave right now (until may 2022), and we at AUSSDA do not use S3 - so I can not test this.
The best way to move forward, would be to resolve the issue by yourselves.
We also just ran into this. From looking at the Dataverse side, uploads using multipart/form-data
should be available.
For the sending side, looks like "requests-toolbelt" has something we could use: https://toolbelt.readthedocs.io/en/latest/uploading-data.html
Maybe it would be good to detect the filesize and either go for a normal upload when <2GB or multipart for larger?
(I don't have the capacity right now to look into this.)
Can this bug be reproduced at https://demo.dataverse.org ? Currently the file upload limit there is 2.5 GB, high enough for a proper test, it would seem.
Also related to https://github.com/gdcc/pyDataverse/issues/136
Update: I left AUSSDA, so my funding for pyDataverse development has stopped.
I want to get some basic funding to implement the most urgent updates (PRs, Bug fixes, maintenance work). If you can support this, please reach out to me. (www.stefankasberger.at). If you have feature requests, the same.
Another option would be, that someone else helps with the development and / or maintenance. For this, also get in touch with me (or comment here).
I know I shall not expect movement here (unless someone else picks it up or we find funding).
But to not let newly found insights slip away and for what it's worth: how about exchanging requests
for aiohttp
?
I know aiohttp is much larger as a dependency, but it does support multipart uploads. https://docs.aiohttp.org/en/stable/multipart.html
Not sure that helps out-of-the-box since our multipart direct upload involves contacting Dataverse to get signed URLs for the S3 parts, etc. FWIW, I think @landreev implemented our mechanism in python, it just hasn't been integrated with pyDataverse.
@qqmyers you are right - direct upload needs more. Maybe one day we also extend pyDataverse for this.
That said: this issue here is about uploading with simple HTTP upload via API. As requests
is not capable of using multipart upload, you are limited to 2GB filesize (same limitation as our SWORD 2.0 library). The API endpoint itself is capable of using multipart uploads.
Bug report
1. Describe your environment
2. Actual behaviour:
Trying to upload a file larger than 2GB causes an error. Uploading the same file using curl works fine.
3. Expected behaviour:
To upload the file. Or at least say that this will not work because the file is too big.
4. Steps to reproduce
The program and stack trace are as follows:
5. Possible solution
Some possible solutions streaming upload or chunk-encoded request) are written here:
https://stackoverflow.com/questions/53095132/how-to-upload-chunks-of-a-string-longer-than-2147483647-bytes
I am not very versed in python, but I will try to fix this in the following week, and submit a pull request. If I fail, feel free to fix this bug!