Uploading files larger than 2GB does not work

pallinger commented 2 years ago

Bug report

1. Describe your environment

OS: Debian 10 (buster) 64bit
pyDataverse: 0.3.1
Python: 3.7.3
Dataverse: 4.20-dev

2. Actual behaviour:

Trying to upload a file larger than 2GB causes an error. Uploading the same file using curl works fine.

3. Expected behaviour:

To upload the file. Or at least say that this will not work because the file is too big.

4. Steps to reproduce

The program and stack trace are as follows:

from pyDataverse.models import Datafile
from pyDataverse.api import NativeApi
df = Datafile()
api = NativeApi(SERVER_URL,API_KEY)
ds_pid=ID_OF_EXISTING_DATASET 
df_filename = PATH_TO_FILENAME_OF_BIG_FILE
df.set({"pid": ds_pid, "filename": df_filename})
api.upload_datafile(ds_pid, df_filename, df.json())

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/pyDataverse/api.py", line 1685, in upload_datafile
    url, data={"jsonData": json_str}, files=files, auth=True
  File "/usr/local/lib/python3.7/dist-packages/pyDataverse/api.py", line 174, in post_request
    resp = post(url, data=data, params=params, files=files)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.7/http/client.py", line 1260, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1069, in _send_output
    self.send(chunk)
  File "/usr/lib/python3.7/http/client.py", line 991, in send
    self.sock.sendall(data)
  File "/usr/lib/python3.7/ssl.py", line 1015, in sendall
    v = self.send(byte_view[count:])
  File "/usr/lib/python3.7/ssl.py", line 984, in send
    return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes`

5. Possible solution

Some possible solutions streaming upload or chunk-encoded request) are written here:

https://stackoverflow.com/questions/53095132/how-to-upload-chunks-of-a-string-longer-than-2147483647-bytes

I am not very versed in python, but I will try to fix this in the following week, and submit a pull request. If I fail, feel free to fix this bug!

jmjamison commented 2 years ago

Forgive me it this isn't relevant. Uploading really large files - in my case Lidar data - I use an s3 bucket set for direct-upload. Now that doesn't work with pyDataverse but for uploading really large files individually a direct-upload bucket is helpful.

pallinger commented 2 years ago

I understand that this is not relevant for you. However, if the dataverse installation in question does not use an s3 storage backend, then this becomes instantly relevant.

skasberger commented 2 years ago

The issue is, i am on parental leave right now (until may 2022), and we at AUSSDA do not use S3 - so I can not test this.

The best way to move forward, would be to resolve the issue by yourselves.

poikilotherm commented 1 year ago

We also just ran into this. From looking at the Dataverse side, uploads using multipart/form-data should be available.

For the sending side, looks like "requests-toolbelt" has something we could use: https://toolbelt.readthedocs.io/en/latest/uploading-data.html

Maybe it would be good to detect the filesize and either go for a normal upload when <2GB or multipart for larger?

(I don't have the capacity right now to look into this.)

pdurbin commented 1 year ago

Can this bug be reproduced at https://demo.dataverse.org ? Currently the file upload limit there is 2.5 GB, high enough for a proper test, it would seem.

skasberger commented 1 year ago

Update: I left AUSSDA, so my funding for pyDataverse development has stopped.

I want to get some basic funding to implement the most urgent updates (PRs, Bug fixes, maintenance work). If you can support this, please reach out to me. (www.stefankasberger.at). If you have feature requests, the same.

Another option would be, that someone else helps with the development and / or maintenance. For this, also get in touch with me (or comment here).

poikilotherm commented 1 year ago

I know I shall not expect movement here (unless someone else picks it up or we find funding).

But to not let newly found insights slip away and for what it's worth: how about exchanging requests for aiohttp?

I know aiohttp is much larger as a dependency, but it does support multipart uploads. https://docs.aiohttp.org/en/stable/multipart.html

qqmyers commented 1 year ago

Not sure that helps out-of-the-box since our multipart direct upload involves contacting Dataverse to get signed URLs for the S3 parts, etc. FWIW, I think @landreev implemented our mechanism in python, it just hasn't been integrated with pyDataverse.

poikilotherm commented 1 year ago

@qqmyers you are right - direct upload needs more. Maybe one day we also extend pyDataverse for this.

That said: this issue here is about uploading with simple HTTP upload via API. As requests is not capable of using multipart upload, you are limited to 2GB filesize (same limitation as our SWORD 2.0 library). The API endpoint itself is capable of using multipart uploads.

gdcc / pyDataverse

Uploading files larger than 2GB does not work #137

Bug report