Enhancements to GEMPAK reader(s)

jthielen commented 3 years ago

Just documenting some desired enhancements to the GEMPAK io submodule brought up in the original PR, other issues, and side conversations, as well as some ideas I had while working on recent PRs:

[x] Speed up snxarray (https://github.com/Unidata/MetPy/issues/2062, resolved by https://github.com/Unidata/MetPy/pull/2064)
[x] Get grid mapping information in the right place (https://github.com/Unidata/MetPy/issues/2046, resolved by https://github.com/Unidata/MetPy/pull/2066)
[ ] Add support for GRIB2 packing reader
- Any way that https://github.com/ecmwf/cfgrib can be leveraged?
[ ] Add unit metadata to variables and coordinates in *xarray() methods (https://github.com/Unidata/MetPy/issues/2118, referenced in https://github.com/Unidata/MetPy/issues/2046)
[ ] Include lat/lon auxiliary coordinates on gdxarray() output
[ ] Consolidate xarray output where possible (multiple DataArrays into Dataset, multiple Datasets into single Dataset)
[ ] Support xarray backend API
- Need to provide subclasses of xarray.backends.BackendEntrypoint (actual backend) and xarray.backends.BackendArray (lazy loaded arrays)
- May require a substantial refactor
[ ] Add more CF metadata (to help with declarative / #3 / #886)
[ ] Clean up (and potentially speed up) interp functions using https://github.com/Unidata/MetPy/issues/1312

sgdecker commented 2 years ago

Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy.

sgdecker commented 2 years ago

Add unit metadata to variables and coordinates in *xarray() methods

A variation of this would be to allow pressures with units for the level keyword argument. For instance:

p = 70000 * units('Pa')
hght = gem_file.gdxarray(parameter='HGHT', level=p)

would return the 700-mb heights rather than producing a TypeError as it does now.

nawendt commented 1 year ago

I've become more interested in combining multiple DataArray into a single Dataset, particularly with grids. I've only taken a cursory look at it. My struggle has been how to deal with combine vertical coordinate levels. It seems like you would have to keep track of a variable name and its particular vertical coordinate and then append levels as appropriate. I think (?) this is what is being done with xarray's own netCDF backend. There is logic in there that checks if things are "stackable". I'll need to understand how to apply that in this case to think of a solution. If anyone has any good ways to conceptualize this, let me know. The other challenge with combining DataArray will come from layer data (i.e., when GEMPAK has GLV1 and GLV2 defined for a grid). I am not sure you can combine that information into a single dimension. The current iteration just uses the first level information. You might need to have grids with another dimension that contains the second layer level.

sgdecker commented 1 year ago

It seems like you would have to keep track of a variable name and its particular vertical coordinate and then append levels as appropriate.

I think this is the crux of the required functionality, but with the extra complication that some variables are only provided on a subset of levels. For instance, you might have geopotential height every 50 mb, but absolute vorticity only at 850, 700, 500, and 300 mb. xarray's backend thus considers geopotential height and absolute vorticity to have two distinct vertical coordinates, and this leads to the explosion of coordinates that you see in, say, the xarray Dataset that corresponds to GFS model output pulled from a TDS. To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level).

dopplershift commented 1 year ago

To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level).

There are probably ways using lazy arrays to leave these missing chunks uninstantiated until they're needed (e.g. for plotting or calculations). Having said that, this would introduce a bunch of NaNs in e.g. profiles, plus it implies that data are available for these levels, only to be left missing. Dealing with this in code leaves a bunch of complications dealing with missing data. I'd personally opt for just having multiple coordinates and letting e.g. data.metpy.sel(vertical=500 * units.hPa) handle dealing with it.

nawendt commented 1 year ago

Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy.

I was taking a closer look at decoding GRIB2 packing recently. It's definitely more complicated than the others to implement. That being said, it seems doable. GEMPAK has what looks like a fully functional GRIB2 decoder. It's possible that it could be pared down a bit if all we care about is just unpacking the data. The GEMPAK decoder also handles PNG and JPEG compressed data. I don't work with GRIB2 packed GEMPAK files, so I do not know how often those appear in the wild. Dealing with those compressions would introduce another layer of complication and perhaps some additional dependencies.

dopplershift commented 1 year ago

Dealing with PNG/JPEG compression (assuming it's not JPEG2k) should be doable using Pillow (which is already in our dependency chain thanks to Matplotlib. The hard part here really isn't decoding the grid, but interpreting all the other bits and bytes that amount to metadata.

I've got an old start on something to walk through GRIB messages:

log = logging.getLogger("metpy.io.grib")
#log.setLevel(logging.WARNING)

def _make_datetime(s):
    s = bytearray(s)
    year = (s[0] << 8) | s[1]
    return datetime(year, *s[2:])

section0 = NamedStruct([('grib', '4s'), ('reserved', 'H'), ('discipline', 'B'),
                        ('edition', 'B'), ('total_length', 'Q')], '>', 'Section0')

section_header =  NamedStruct([('section_len', 'I'), ('section_num', 'B')], '>', 'section_head')

class Descriptor(object):
    def __init__(self, fxy):
        pass

    def read(self, buf):
        pass

class Element(Descriptor):
    def __init__(self, fxy):
        self._id = fxy
        self._info = bufr_table_b[fxy]
        self.bits = int(self._info['BUFR_DataWidth_Bits'])

    def read(self, src):
        return src.read(self.bits)

class Replication(Descriptor):
    def __init__(self, descs, repeats):
        pass

    def read(self, src):
        return src.read(self.bits)

class Operator(Descriptor):
    def __init__(self, fxy):
        pass

class Sequence(Descriptor):
    def __init__(self, fxy):
        self._id = fxy
        self._info = bufr_table_d[fxy]
        self.bits = [int(bufr_table_b[int(i['FXY2'])]['BUFR_DataWidth_Bits']) for i in self._info]

    def read(self, src):
        return [src.read(b) for b in self.bits]

class GRIBMessage(object):
    def __init__(self, fobj):
        hdr = section0.unpack_file(fobj)
#        print(hdr)
        if hdr.grib != b'GRIB':
            logging.error('Bad section 0: %s', hdr)
            raise RuntimeError('Bad section 0: %s' % str(hdr))

        self._buffer = IOBuffer(fobj.read(hdr.total_length - section0.size))
        self.sections = [hdr]

        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        sec = self._read_section()
#        print(sec)
        end_bytes = self._buffer.read(4)
        if end_bytes != b'7777':
            logging.warning('Last bytes not correct: %s', end_bytes)

    def _read_section(self, jump_end=True):
        sec_start = self._buffer.set_mark()
        sec_hdr = self._buffer.read_struct(section_header)

        sec = getattr(self, '_read_section{:d}'.format(sec_hdr.section_num))()
        self.sections.append(sec)
        if jump_end:
            self._buffer.jump_to(sec_start, sec_hdr.section_len)
        return sec

    section1 = NamedStruct([('center', 'H'),
                            ('subcenter', 'H'), ('master_table_version', 'B'), ('local_table_version', 'B'),
                            ('ref_time_significance', 'B'), ('dt', '7s', _make_datetime),
                            ('production_status', 'B'), ('data_type', 'B')], '>', 'Section1')
    #                        ('template_id', 'H')], '>', 'Section1')
    def _read_section1(self):
        return self._buffer.read_struct(self.section1)

    def _read_section2(self):
        raise NotImplementedError('Local section not handled')

    section3 = NamedStruct([('grid_def_source', 'B'), ('num_data_points', 'I'), ('num_octets', 'B'),
                            ('interp', 'B'), ('template_num', 'H')], '>', 'Section3')

    def _read_section3(self):
        return self._buffer.read_struct(self.section3)

    section4 = NamedStruct([('num_values', 'H'), ('prod_template', 'H')], '>', 'Section4')
    def _read_section4(self):
        return self._buffer.read_struct(self.section4)

    section5 = NamedStruct([('num_values', 'I'), ('data_template', 'H')], '>', 'Section5')
    def _read_section5(self):
        return self._buffer.read_struct(self.section5)

    section6 = NamedStruct([('indicator', 'B')], '>', 'Section6')
    def _read_section6(self):
        return self._buffer.read_struct(self.section6)

    def _read_section7(self):
        self._data_ptr = self._buffer._offset

nawendt commented 1 year ago

It does appear to be JPEG2k.

dopplershift commented 1 year ago

I kinda figured since that's why we were dealing with the JJ2000 library over on the netcdf-java side. Looks like Pillow can support jpeg2k, not sure how regularly it's included in builds. We can cross that bridge when we get to it I guess.

Unidata / MetPy

Enhancements to GEMPAK reader(s) #2067