dolthub / dolthub-issues

Issues for dolthub.com
https://dolthub.com
4 stars 1 forks source link

Large CSV downloads time out #303

Closed noamross closed 1 year ago

noamross commented 1 year ago

Downloading large CSVs via browser or programmatically using an API token times out for large tables

 curl -s https://www.dolthub.com/csv/dolthub/museum-collections/main/objects | pv > objects.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9348M    0 9348M    0     0  10.2M      0 --:--:--  0:15:13 --:--:-- 7274k
curl: (18) transfer closed with outstanding read data remaining

This behavior occurs via the the browser or command line, on private or public repositories.

I realize that this might be expected behavior due to limits but it is not documented at https://docs.dolthub.com/concepts/dolthub/api, in which case the solution may be to conduct paginated API calls and convert the JSON to CSV, as neccessary, or fully clone the database. That said, the timeout is a barrier to using DoltHub to distribute data to collaborators or more broadly. (We discovered this after pointing a colleague to our DH repo to share our data.)

Interestingly the ZIP file of the museums database, which is smaller (1.6GB) as this database only contains one table, downloaded fine, so perhaps this is a total time or download limit rather than a query limit and could be solved by providing compressed CSVs?

cc @emmamendelsohn

reltuk commented 1 year ago

@noamross Thanks for this bug report. We were able to reproduce the observed behavior.

We found a stream idle timeout combined with some decently aggressive response buffering in our routing infrastructure. The end result is that a large portion of the response can get buffered, then streamed to the requesting client over time, and the upstream which already sent that portion of the response and is waiting for the window to open back up can see a stream idle timeout.

We have reduced the internal buffering, which was not intended behavior, and we have increased the stream idle timeout.

Things work better for me now in my local testing.

It's worth noting in regards to this: our infrastructure doesn't have super lenient connection draining policies during things like deployments, so it's still possible a request which runs for 15 minute will see spurious disconnects. But they definitely shouldn't be deterministic now.

Maybe a feature request would be to add export table CSV to S3, which could be done in parallel and would provide a resumable download URL. Or adding functionality to dolt cli that could dump a table in CSV form directly from a dotlhub repository; that could also support resumption.

For now I will close this issue, but feel free to re-open if you still see this behavior.