postprocessing tasks need to be gzip aware

bertinia commented 8 years ago

For older runs, some of the history and log files are only available in gzip format. The postprocessing tools need to be able to gunzip files in parallel via a library call and separate utility in the postprocessing suite of tools.

bandre-ucar commented 8 years ago

https://docs.python.org/2/library/gzip.html

On Thu, Apr 28, 2016 at 2:55 PM, Alice Bertini notifications@github.com wrote:

For older runs, some of the history and log files are only available in gzip format. The postprocessing tools need to be able to gunzip files as in parallel via a library call and separate utility in the postprocessing suite of tools.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/NCAR/CESM_postprocessing/issues/10

bertinia commented 8 years ago

Thanks Ben - I'm using this module currently for files that I know are gzipped as part of the model run output (e.g. coupler logs). The issue is to make sure we check if the input netcdf or log files are gunzipped prior to running the postprocessing tools rather than fail.

My thought is to create a postprocess library with the necessary calls to the python gzip library that can then be referenced from any one of the generator tools as well as a stand-alone tool.

This step used to be handled in the stand-alone driver scripts for the diags along with retrieving files from HPSS if necessary. We won't be adding in the HPSS step as part of this workflow but I think we do need to be able to handle gzipped files seamlessly.

bandre-ucar commented 8 years ago

See:

https://docs.python.org/2/library/mimetypes.html

For the text files, you should just check the mimetype. If it is gzip, read the compressed file with that module. Otherwise open the standard file. Then I think you can use either file handle interchangeably one it's open.... The library routine is probably just a few lines.

For the netcdf files, you can check the mimetype, unzip them into a buffer in python, then write it. But I guess the simplest thing to do is a system call out to gunzip. Again not much code. The bulk of the work would be if you want to parallelize unzipping netcdf with mpi.

Ben

On Thu, Apr 28, 2016 at 4:04 PM, Alice Bertini notifications@github.com wrote:

Thanks Ben - I'm using this module currently for files that I know are gzipped as part of the model run output (e.g. coupler logs). The issue is to make sure we check if the input netcdf or log files are gunzipped prior to running the postprocessing tools rather than fail.

My thought is to create a postprocess library with the necessary calls to the python gzip library that can then be referenced from any one of the generator tools as well as a stand-alone tool.

This step used to be handled in the stand-alone driver scripts for the diags along with retrieving files from HPSS if necessary. We won't be adding in the HPSS step as part of this workflow but I think we do need to be able to handle gzipped files seamlessly.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/NCAR/CESM_postprocessing/issues/10#issuecomment-215576542

NCAR / CESM_postprocessing

postprocessing tasks need to be gzip aware #10