Duke-GCB / D4S2

Web service to facilitate notification and transfer of projects in DukeDS
MIT License
0 stars 0 forks source link

Large projects fail to download with download service #222

Closed dleehr closed 3 years ago

dleehr commented 4 years ago

See RT6792

dleehr commented 4 years ago

Lots of good info from previous timeouts in #212

dleehr commented 4 years ago

The user first reported that on the datadelivery download page, Calculating never completes. This is because the project summary (/api/v2/duke-ds-projects/:project_id/summary) endpoint is timing out. In the server logs I see messages like [2020-02-17 10:38:41 -0500] [15] [CRITICAL] WORKER TIMEOUT (pid:32).

This error appeared in #212 for zip downloads. We addressed it only on the download service (https://github.com/Duke-GCB/gcb-ansible-roles/pull/70) by disabling proxy buffering and using gevent workers. These configurations were not generally necessary for simple API calls, but made sense for long-running downloads. We may need to apply them to the base app as well.

This is configured in the ansible role - the worker config is an env variable in the task:

https://github.com/Duke-GCB/gcb-ansible-roles/blob/837bec8a9b8f4801025613b84d18c427e9b263f2/d4s2_webapp/tasks/main.yml#L25-L49

and the proxy buffering is in the nginx config:

https://github.com/Duke-GCB/gcb-ansible-roles/blob/837bec8a9b8f4801025613b84d18c427e9b263f2/d4s2_webapp/files/nginx.conf#L45

dleehr commented 4 years ago

There's also the matter of the actual download process, which I haven't yet tested. I've requested download access on this project and can test on dev.

dleehr commented 4 years ago

Exploration in https://github.com/Duke-GCB/D4S2/tree/fixing-40k-files. Some notes from local docker-compose testing, on wired network:

So we might be able to get projects of this size working if we raise the timeouts to 3600 and implement token expiration check/refresh. But, it still takes 30 minutes for a download to start that will take hours and can't be resumed.

johnbradley commented 4 years ago

The recent DukeDSClient changes to simplify downloading will fix the slow startup time problem. Refreshing the DukeDS tokens when the expire seems like something we could also fix. I'm not sure about how we would support resuming a download.

johnbradley commented 4 years ago

Looks like we still have a startup time problem after using the download generator. We currently build a zip file that streams the individual contents using generators but all files must be first seen to build a ZipFile: https://github.com/Duke-GCB/D4S2/blob/005f24ffeab0ff4b62982bc3d2fe7512e69c0db9/download_service/zipbuilder.py#L128-L134 When using the download generator we also create DDS URLs prematurely and require recreating them when they are expired.

ZipFile comes from zipstream-new. Version 1.1.6 has a z.flush option that may help with this problem: https://github.com/arjan-s/python-zipstream/pull/1 Example from their README:

        z = zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED)
        ...
        for filename in os.listdir('/path/to/files'):
            z.write(os.path.join('/path/to/files', filename), arcname=filename)
            yield from z.flush()
        yield from z
johnbradley commented 4 years ago

The download for the example project that took 35 minutes to start begins downloading immediately now. There is still a 2 hour limit on downloads due to not automatically refreshing the DDS token using the OAuth token.

johnbradley commented 3 years ago

The 2 hour download limit should no longer be a problem. I see no simple solution to support resuming a download after an interruption. For large projects like the above example the best solution is(still) to use DukeDSClient.

Marking this closed since we have improved this as much as possible given the current design.