SpatioTemporal / STAREPandas

STAREpandas adds SpatioTemporal Adaptive Resolution Encoding (STARE) support to pandas DataFrames. https://starepandas.readthedocs.io/en/latest/
MIT License
4 stars 1 forks source link

Advice for using a sidecar file with gridded data #144

Open mbauer288 opened 1 year ago

mbauer288 commented 1 year ago

I'm having problems correctly using a sidecar file for a gridded dataset (IMERG). Here is the generalized problem.

1) I make a sidecar file for IMERG using STAREMaster_py ...

    $ python create_sidecar_files.py --workers 4 --grid IMERG

    IMERG_stare.nc 
        >>>> ulong STARE_index(i=1800, j=3600);

And this matches up with the dataset I'm aiming to pod:

    DYAMONDv2_PE3600x1800-DE.prectot.20200116_0000z.nc4
        >>>> float PRECTOT(time=1, lat=1800, lon=3600);

2) But when I attempt to form a STAREPandas dataframe using it as so...

    gdf = starepandas.read_granule(granule_path, sidecar=True,
                                   latlon=False, read_timestamp=False,
                                   sidecar_path=sidecar_path,
                                   add_sids=False, adapt_resolution=True)

I get an error which leads me to that think I'm likely evoking the starepandas.read_granule() method incorrectly for this kind of sidecar (gridded data). Inside said method, I see that it gets the correct sidecar info, although the self.nom_res = None parameter might be an issue.

    granule.read_sidecar_index(): 
        ds = <class 'netCDF4._netCDF4.Dataset'>
        ds.dimensions = 'i'          : size = 1800 
                        'j'          : size = 3600 
                        'l'          : size = 8
        ds.variables  = 'STARE_index': uint64 STARE_index(i, j)
                        'STARE_cover': uint64 STARE_cover(l)
        self.nom_res  = None

Here is where the error is raised (in starepandas.read_granule()).

    try:
        self.sids = ds['STARE_index_{}'.format(self.nom_res)][:, :].astype(numpy.int64)
    except IndexError:
        # If we don't have a nomres?
        >>>> 'STARE_index{}'.format(self.nom_res) = 'STARE_indexNone'
        self.sids = ds['STARE_index{}'.format(self.nom_res)][:, :].astype(numpy.int64)

        >>>> STAREPandas/starepandas/io/granules/granule.py", line 66, in read_sidecar_index
        >>>>   self.sids = ds['STARE_index{}'.format(self.nom_res)][:, :].astype(numpy.int64)
        >>>>   IndexError: STARE_indexNone not found in /

Any suggestions? Thanks.


Note, starepandas.read_granule() is inherited from the Granule class via a class I created) as should be clear from the code snippet below.

STAREPandas/starepandas/io/granules/
    __init__.py:
        from .imergl3 import L3IMERG, DYAMONDv2

        Added 
            'L3IMERG': L3IMERG,
            'DYAMONDv2': DYAMONDv2
        to granule_factory_library

    imergl3.py:

        # Standard Imports
        import os
        import datetime

        # Third-Party Imports
        import numpy

        # STARE Imports
        from starepandas.io.granules.granule import Granule
        import starepandas

        class L3IMERG(Granule):
            def __init__(self, file_path, sidecar_path=None):
                # Use Granule.__init__() for this instance (self)
                super().__init__(file_path, sidecar_path)
                ##
                # Opens file_path using netCDF4
                self.netcdf = starepandas.io.s3.nc4_dataset_wrapper(self.file_path, 'r', format='NETCDF4')

            # Not implemented yet
            # def read_timestamps(self):
            #    self.ts_start = self.netcdf.time_coverage_start
            #    self.ts_end = self.netcdf.time_coverage_end

            def read_latlon(self):
                self.lat = self.netcdf['lat'][:].astype(numpy.double)
                self.lon = self.netcdf['lon'][:].astype(numpy.double)

        class DYAMONDv2(L3IMERG):
            """Special case for IMREG precip files with file names like
                DYAMONDv2_PE3600x1800-DE.prectot.20200116_0000z.nc4
            """
            def __init__(self, file_path, sidecar_path=None):
                # Use L3IMERG.__init__(), which calls Granule.__init__() for this instance (self)
                super().__init__(file_path, sidecar_path=sidecar_path)

Then in my code:

    from STAREpodder.cfg.def_IMERGPF import IMERGPF

    starepandas.io.granules.granule_factory_library['DYAMONDv2'] = IMERGPF

    gdf = starepandas.read_granule(granule_path, sidecar=True,
                               latlon=False, read_timestamp=False,
                               sidecar_path=sidecar_path,
                               add_sids=False, adapt_resolution=True)

I should add some potential dependency issues that might be at work here (e.g., Pandas 2.0).

Name Version
astropy 5.2.2
cartopy 0.21.1
gdal 3.6.3
geopandas 0.12.2
geopandas-base 0.12.2
geos 3.11.2
h5py 3.8.0
hdf4 4.2.15
hdf5 1.12.2
hdfeos2 2.20
matplotlib 3.7.1
numpy 1.24.2
pandas 2.0.0
proj 9.1.1
pygeos 0.14
pyhdf 0.10.5
pyproj 3.5.0
pyshp 2.3.1
pytest 7.3.1
python 3.11.3
shapely 2.0.1
STARE Installs
pystare 0.8.12 STARE
staremaster 0.0.4 STARE
starepandas 0.6.6 STARE
michaelleerilee commented 1 year ago

This is a bug. the nom_res is None, which should be mapped to a null string, not to the string "None". As a workaround, try setting the read_granule argument nom_res="".

Granules can have different resolution data, which implies a need to select between them. This is done via the argument _nomres, which stands for nominal_resolution. It is a string that is appended to the canonical variable names, i.e. if nom_res='750km', you'd refer to latitude_750km or latitude750km.

The fix for the Granule class is to check nom_res for None and set to "", if so.

mbauer288 commented 1 year ago

Okay, I was toying with the "fix" to put something like:

nom_res_str = "" if nom_res is None else nom_res

or

nom_res_str = "" if self.nom_res is None else self.nom_res

at the top of certain granule methods.

This fixes the problem as you thought.

mbauer288 commented 1 year ago

Hmm, I think I need a bit more help. So the previous problem is solved, but now I get an error forming the STARE dataframe.

gdf = starepandas.read_granule(granule_path, sidecar=True,
                               latlon=False, read_timestamp=False,
                               sidecar_path=sidecar_path,add_sids=False, 
                               adapt_resolution=True, nom_res=None)

in granule.to_df(xy=False): 
        to_df(): self.lat.shape = (1800,)            flattened (1800,)
        to_df(): self.lon.shape = (3600,)            flattened (3600,)
        to_df(): self.sids.shape = (1800, 3600)      flattened (6480000,)
        to_df(): dtype = dtype('float64')
        to_df(): self.data[key].shape = (1800, 3600) flattened (6480000,)
        to_df(): dtype = dtype('float64')
        to_df(): series.shape = (6480000,)

raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

I haven't implemented the temporal part of the inputs yet, but I don't see how that gives this error.

michaelleerilee commented 1 year ago

The dataframe (Granule class I think) doesn't know how to integrate the 1D lat and lon and the 2D sids into a single frame. to_df will have to be specialized (or read_granule) to rectify lat and lon to the same size and shape as sids. This is a common issue if one has a simple 2D regular grid and one only keeps the 1D forms of the dimensional coordinates.

Alternatively, the sidecar constructor could write 2D lat and lon coordinate variables (commensurate with the data) and the existing Granule class (which was originally developed for Level 2 non-gridded data) would work. This approach takes more space in storage, but each grid type only needs one sidecar file.

One approach would be to make Granule smarter (i.e. guess how to rectify the coordinate arrays) or make end-users specialize the method or class.

mbauer288 commented 1 year ago

Okay,

So something like

##
# Make a (lat=1800, lon=3600) mesh grid to match the sidecar SIDs array
#   Use Cartesian indexing lat[i, j]
lon, lat = numpy.meshgrid(lon, lat, copy=True, indexing='xy')

sids (1800, 3600)
lat (1800) = [ -89.950 ...  +89.950]
lat (3600) = [-179.950 ... +179.950]
lat (1800, 3600)
lon (1800, 3600)

    i     j   lat     lon
[   0,    0]: -89.950 -179.950
[   0,    1]: -89.950 -179.850
[   0,    2]: -89.950 -179.750
...
[   0, 1797]: -89.950   -0.250
[   0, 1798]: -89.950   -0.150
[   0, 1799]: -89.950   -0.050

...

[1799,    0]: +89.950 -179.950
[1799,    1]: +89.950 -179.850
[1799,    2]: +89.950 -179.750
...
[1799, 3597]: +89.950 +179.750
[1799, 3598]: +89.950 +179.850
[1799, 3599]: +89.950 +179.950
michaelleerilee commented 1 year ago

Yes, something along those lines should work.

mbauer288 commented 1 year ago

That works, IMERG podding code test nearing completion...

michaelleerilee commented 1 year ago

@NiklasPhabian I'm working on a fix.