Large projects fail to download with download service

dleehr commented 4 years ago

See RT6792

dleehr commented 4 years ago

Lots of good info from previous timeouts in #212

dleehr commented 4 years ago

The user first reported that on the datadelivery download page, Calculating never completes. This is because the project summary (/api/v2/duke-ds-projects/:project_id/summary) endpoint is timing out. In the server logs I see messages like [2020-02-17 10:38:41 -0500] [15] [CRITICAL] WORKER TIMEOUT (pid:32).

This error appeared in #212 for zip downloads. We addressed it only on the download service (https://github.com/Duke-GCB/gcb-ansible-roles/pull/70) by disabling proxy buffering and using gevent workers. These configurations were not generally necessary for simple API calls, but made sense for long-running downloads. We may need to apply them to the base app as well.

This is configured in the ansible role - the worker config is an env variable in the task:

https://github.com/Duke-GCB/gcb-ansible-roles/blob/837bec8a9b8f4801025613b84d18c427e9b263f2/d4s2_webapp/tasks/main.yml#L25-L49

and the proxy buffering is in the nginx config:

https://github.com/Duke-GCB/gcb-ansible-roles/blob/837bec8a9b8f4801025613b84d18c427e9b263f2/d4s2_webapp/files/nginx.conf#L45

dleehr commented 4 years ago

There's also the matter of the actual download process, which I haven't yet tested. I've requested download access on this project and can test on dev.

dleehr commented 4 years ago

Exploration in https://github.com/Duke-GCB/D4S2/tree/fixing-40k-files. Some notes from local docker-compose testing, on wired network:

A project summary of 13,715 subfolders and 43,814 files took 24 minutes (~1500 sec) to complete. Under the hood, this is using get_project_children, which pages through all the children of the project. So this requires raising the proxy timeout over that threshold.
I was able to get the download to start eventually, but it took 35 minutes (~2000 sec) before my browser received the first byte. This is due to ZipBuilder.get_dds_paths. It has to fetch the names and sizes of all the files from DukeDS (just like project summary), but uses PathToFiles.
Once the download started, it ran for a few minutes as hoped. It was interrupted because my DukeDS token expired in the middle of the download. The DukeDS tokens issued for OAuth exchange have a 2 hour lifespan, so it's easy to see how a single token won't live long enough for a big download.

So we might be able to get projects of this size working if we raise the timeouts to 3600 and implement token expiration check/refresh. But, it still takes 30 minutes for a download to start that will take hours and can't be resumed.

johnbradley commented 4 years ago

The recent DukeDSClient changes to simplify downloading will fix the slow startup time problem. Refreshing the DukeDS tokens when the expire seems like something we could also fix. I'm not sure about how we would support resuming a download.

johnbradley commented 4 years ago

Looks like we still have a startup time problem after using the download generator. We currently build a zip file that streams the individual contents using generators but all files must be first seen to build a ZipFile: https://github.com/Duke-GCB/D4S2/blob/005f24ffeab0ff4b62982bc3d2fe7512e69c0db9/download_service/zipbuilder.py#L128-L134 When using the download generator we also create DDS URLs prematurely and require recreating them when they are expired.

ZipFile comes from zipstream-new. Version 1.1.6 has a z.flush option that may help with this problem: https://github.com/arjan-s/python-zipstream/pull/1 Example from their README:

        z = zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED)
        ...
        for filename in os.listdir('/path/to/files'):
            z.write(os.path.join('/path/to/files', filename), arcname=filename)
            yield from z.flush()
        yield from z

johnbradley commented 4 years ago

The download for the example project that took 35 minutes to start begins downloading immediately now. There is still a 2 hour limit on downloads due to not automatically refreshing the DDS token using the OAuth token.

johnbradley commented 3 years ago

The 2 hour download limit should no longer be a problem. I see no simple solution to support resuming a download after an interruption. For large projects like the above example the best solution is(still) to use DukeDSClient.

Marking this closed since we have improved this as much as possible given the current design.

Duke-GCB / D4S2

Large projects fail to download with download service #222