artefactual-sdps / temporal-activities

Temporal activities is a library of general purpose activities
Apache License 2.0
1 stars 0 forks source link

Feature: Add an extract archive activity #6

Closed djjuhasz closed 6 months ago

djjuhasz commented 6 months ago

Is your feature request related to a problem? Please describe.

We have created a number different implementations of an activity to extract the contents of an archive file (e.g. tar, zip) in the different Enduro projects:

There are also several private client implementations that I haven't listed here.

Maintaining all the different implementations is time consuming. There is also variation in how each implementation deals with archives that include a single, top-level directory, and whether they extract to a temporary file or not.

Describe the solution you'd like

Implement a general extract activity in this project that can be imported into the various Enduro projects and client pre-processing workflows. Having a single implementation will reduce maintenance costs and make the extraction results more consistent and predictable.

Describe alternatives you've considered

  1. Implement an archive extraction package (not a Temporal activity) in a stand-alone repository.
  2. Implement an archive extraction package (not a Temporal activity) in https://github.com/artefactual-labs/gotools/.

Additional context

The https://github.com/artefactual-sdps/preprocessing-sfa/blob/main/internal/activities/extract_package.go implementation has a few nice features that I think should be included in this repo:

  1. It checks for a single top-level directory after extraction and if one is present then it returns the path to that directory, without requiring the caller to set a "removeTopLevelDirectory" flag to remove the top-level dir
  2. It doesn't try to remove at top-level directory from the extraction directory, it just returns the path of the extraction directory or the top-level directory depending, which achieves the same goal for the caller and is a simpler solution
  3. It extracts the archive to a random temporary directory which avoids possible path contamination or extract errors if the same package is extracted more than once