Refactor zip archive processing to using dynamic zip generation and streaming

servilla commented 1 year ago

When a user requests a zip archive file, the current processing approach is first to check if the zip file exists in a cache and then, if it does, to begin streaming it. If the zip archive does not exist, the first step is to create the zip archive file and then begin streaming it. This means that the user of the first request pays the price of a long wait while the zip file is created. This is not critical for small volumes of data, but multiple GBs may result in a time-out for that first request. In addition, the cached zip archive files require additional disk storage.

For these reasons, we should refactor the workflow from storing cached versions of the zip archive file to one where the zip archive is dynamically created and streamed back to the user in real time. We assume this will incur a small overhead in the dynamic compression but do not believe it will be humanly noticeable.

servilla commented 1 year ago

Successful completion of this issue will resolve #78 since cached versions of the zip archive file will no longer be required.

servilla commented 1 year ago

An ensuing discussion on this issue led to options of either addressing this in the existing Java code base (e.g., Data Package Manager service) or using a Python web framework. This particular service call can be easily implemented in Python since it can be accomplished independently of any other Java classes. We ultimately decided to stay within the Java code base for the following reasons:

Java supports streaming zip content (as does Python).
The packaging contents already exist within the current Zip Archive processing.
The current processing already has access to the data store (read-only access would have to be added to a server where a Python app would exist).
We do not have a decided-upon Python framework pattern for building out existing PASTA services.

PASTAplus / PASTA

Refactor zip archive processing to using dynamic zip generation and streaming #104