Python and BUFR compression.

rmclaren commented 5 months ago

The python bindings don't seem to to support writing compressed BUFR data (ex: cmpmsg.f is not available). Would be nice to have this. My use case is that I want to generate some test data by sampling some larger BUFR files into much smaller representative data sets...

rmclaren commented 5 months ago

I gave it a shot myself (exposed cmpmsg) but ran intoo 2 issues:

1) calling cmpmsg('Y') ended up making larger BUFR files. 2) not all BUFR data seems to be compressible as you run out of room in a fixed sized array somewhere...

 import os
 import ncepbufr
 from glob import glob

 NUM_SUBSETS = 250

 def minfile(in_file, out_file):

     subset_cnts = {}
     in_bufr = ncepbufr.open(in_file)
     out_bufr = ncepbufr.open(out_file, 'w', in_bufr)

     out_bufr.cmpmsg('Y')
     while in_bufr.advance() == 0:  # loop over messages.
         if in_bufr.msg_type not in subset_cnts:
             subset_cnts[in_bufr.msg_type] = 0

         out_bufr.open_message(in_bufr.msg_type, in_bufr.msg_date)
         while in_bufr.load_subset() == 0:
             subset_cnts[in_bufr.msg_type] += 1
             if subset_cnts[in_bufr.msg_type] <= NUM_SUBSETS:
                 out_bufr.copy_subset(in_bufr)
         out_bufr.close_message() # close message

    in_bufr.close()
    out_bufr.close()

rmclaren commented 5 months ago

feature/python_compress

jbathegit commented 4 months ago

Despite the name, it is possible for a file size to increase when using BUFR "compression". The reason is because of the way BUFR compression works, by looking at corresponding values across all subsets in a BUFR message, then storing the minimum of each set along with the increment/offset from that minimum. But this only works if all of the subsets in a message have an identical number of values, which unfortunately often isn't the case if delayed replication is involved.

In other words, if you have different delayed replication factors across different subsets, then you no longer have an identical number of values in each of those subsets, and therefore you can't compress them. So if you have compression turned on in this situation, then what ends up happening is that each of those individual subsets has to go into its own separate message in the output file, and you lose the benefits of compression and can often end up with an even larger file than you started with.

So bottom line, does the file you're trying to compress contain delayed replication with varying replication factors among different subsets in each message? If so, then that could definitely explain what you're seeing.

rmclaren commented 4 months ago

Many of them do (I'm working on a whole directory of files)... But even the contents of MHS get bigger (uses fixed reps...).

Seems like they could have just use zip compression (just zip the BUFR file). That way the internal details of the file would be irrelevant... Many file formats actually do this. For example a microsoft word save file is really just a zip file. If you expand it you find a bunch of XML files... Not our call, just saying :)

Anyhows

jbathegit commented 4 months ago

Understood, but any additional compression involving zip, gzip, etc. is outside of the purview of NCEPLIBS-bufr. The library only handles the compression algorithm that's internal to BUFR itself as prescribed in the WMO regulations, and which is what I described above.

That said, we could still try to pull in your python binding for cmpmsg, in case you think it might be useful to have available for other future applications(?)

rmclaren commented 4 months ago

Up to you. I'm just doing some one off things...

NOAA-EMC / NCEPLIBS-bufr

Python and BUFR compression. #562