Adds support for non-text (binary) files to all file reading apps
Adds FormatFromZip transform for unarchiving Zip files
Motivation and Context
This adds basic support for unarchiving Zip files (mentioned in #219). Most data processing systems don't work on archive files, so this doesn't add a complementary FormatToZip transform (that would require much more design work).
The more important addition in this PR is support for non-text files -- in pre-v1.0 this behavior was configurable using an environment variable, but now it's dynamic based on media (file) type. This could go in two directions in the future:
Add more transforms like FormatFromZip (these can be configurable)
Add support for dynamic unarchiving (similar to existing decompression) (these cannot be configurable)
I'm inclined to keep the existing text support as-is (with decompression) and lean into adding more transforms -- the use cases for reading binary files is limited (most users are working with text files) and recursively unarchiving / decompressing files may become a challenge over time.
How Has This Been Tested?
Added unit tests for the transform.
Added integration test (examples/config/transform/format/zip/)
This has been end to end tested in several production pipelines.
Types of changes
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation.
Description
FormatFromZip
transform for unarchiving Zip filesMotivation and Context
This adds basic support for unarchiving Zip files (mentioned in #219). Most data processing systems don't work on archive files, so this doesn't add a complementary
FormatToZip
transform (that would require much more design work).The more important addition in this PR is support for non-text files -- in pre-v1.0 this behavior was configurable using an environment variable, but now it's dynamic based on media (file) type. This could go in two directions in the future:
FormatFromZip
(these can be configurable)I'm inclined to keep the existing text support as-is (with decompression) and lean into adding more transforms -- the use cases for reading binary files is limited (most users are working with text files) and recursively unarchiving / decompressing files may become a challenge over time.
How Has This Been Tested?
examples/config/transform/format/zip/
)Types of changes
Checklist: