Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

Make unpacking archives optional #2215

Open anjackson opened 2 years ago

anjackson commented 2 years ago

We are using MrJob to process WARC files, in similar manner to this example given in the Writing Jobs guide.

For our use case, it is crucial that the .gz compressed file is not automatically decompressed before use.

This PR proposes a new setting that would allow this to be controlled via a unpack_archives option passed to the MrJob runner. This new option defaults to True to maintain the expected default behaviour, while allowing us to set it to False when needed. We have tested this locally and it seems to work just fine.

I've attempted to document this new option, as per the contributing guidelines, but I'm not sure I've covered everything. Is there any other documentation I should add?