Closed dleehr closed 3 years ago
Lots of good info from previous timeouts in #212
The user first reported that on the datadelivery download page, Calculating never completes. This is because the project summary (/api/v2/duke-ds-projects/:project_id/summary) endpoint is timing out. In the server logs I see messages like [2020-02-17 10:38:41 -0500] [15] [CRITICAL] WORKER TIMEOUT (pid:32)
.
This error appeared in #212 for zip downloads. We addressed it only on the download service (https://github.com/Duke-GCB/gcb-ansible-roles/pull/70) by disabling proxy buffering and using gevent workers. These configurations were not generally necessary for simple API calls, but made sense for long-running downloads. We may need to apply them to the base app as well.
This is configured in the ansible role - the worker config is an env variable in the task:
and the proxy buffering is in the nginx config:
There's also the matter of the actual download process, which I haven't yet tested. I've requested download access on this project and can test on dev.
Exploration in https://github.com/Duke-GCB/D4S2/tree/fixing-40k-files. Some notes from local docker-compose testing, on wired network:
So we might be able to get projects of this size working if we raise the timeouts to 3600 and implement token expiration check/refresh. But, it still takes 30 minutes for a download to start that will take hours and can't be resumed.
The recent DukeDSClient changes to simplify downloading will fix the slow startup time problem. Refreshing the DukeDS tokens when the expire seems like something we could also fix. I'm not sure about how we would support resuming a download.
Looks like we still have a startup time problem after using the download generator. We currently build a zip file that streams the individual contents using generators but all files must be first seen to build a ZipFile: https://github.com/Duke-GCB/D4S2/blob/005f24ffeab0ff4b62982bc3d2fe7512e69c0db9/download_service/zipbuilder.py#L128-L134 When using the download generator we also create DDS URLs prematurely and require recreating them when they are expired.
ZipFile comes from zipstream-new. Version 1.1.6 has a z.flush option that may help with this problem: https://github.com/arjan-s/python-zipstream/pull/1 Example from their README:
z = zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED)
...
for filename in os.listdir('/path/to/files'):
z.write(os.path.join('/path/to/files', filename), arcname=filename)
yield from z.flush()
yield from z
The download for the example project that took 35 minutes to start begins downloading immediately now. There is still a 2 hour limit on downloads due to not automatically refreshing the DDS token using the OAuth token.
The 2 hour download limit should no longer be a problem. I see no simple solution to support resuming a download after an interruption. For large projects like the above example the best solution is(still) to use DukeDSClient.
Marking this closed since we have improved this as much as possible given the current design.
See RT6792