SpatioTemporal / STAREPandas

STAREpandas adds SpatioTemporal Adaptive Resolution Encoding (STARE) support to pandas DataFrames. https://starepandas.readthedocs.io/en/latest/
MIT License
4 stars 1 forks source link

Problems saving to GeoPackage #122

Open mbauer288 opened 1 year ago

mbauer288 commented 1 year ago

An FYI there are two issues to be aware of concerning saving STAREPandas DataFrames to the GeoPackage (GPKG).

Issue-A: The GeoPackage API does not allow for multiple geometry-type columns in a DataFrame.

This happens when your DataFrame already has a geometry column (POINT, POLYGON etc.) and then you another geometry column such as a trixel cover (POLYGON). This will not save as is to a GeoPackage file. Note that GeoPackage is find with additional non-geometry columns like SIDs.

Issue-B: The GeoPackage API does not allow for array-type columns in a DataFrame.

This is similar to Issue-A, but affects attempts to even save just SIDs when the underlying geometry is POLYGON (or similar), because the SIDs are returned as numpy arrays rather than a single int64. The solution here is to convert the SID array in each row to a string. This will allow you to save as a GeoPackage file. However, you will have to reverse this when reading these files (converting each SID row from a string back to a numpy array of int64). The following is a simple example of this.

# Standard Imports

# Third-Party Imports
import geopandas as gpd
import numpy as np

# STARE imports
import starepandas as spd

##
# Read shapefile, make a STAREDataFrame, and find the SIDs
shapefile_fname = "/VIIRS/HMS_GASP/Smoke_Polygons/2019/01/hms_smoke20190101.zip"
shp_file_gdf = gpd.read_file(sfile)
shp_file_sdf = spd.STAREDataFrame(shp_file_gdf)
shp_file_sids = shp_file_sdf.make_sids(10, n_partitions=1)
r"""
>>>> type(shp_file_sids) = <class 'pandas.core.series.Series'>
    shp_file_sids = 
        0    [3133450908998828042, 3133454757289525258, ... ]
        1    [3119807069209755656, 3119536589349322761, ... ]
        2    [3124957181674258442, 3124957731430072330, ... ]
        3    [3330590045324181514, 3330601040440459274, ... ]
        4    [3330590045324181514, 3330601040440459274, ... ]
        5    [3330659314556731402, 3330663162847428618, ... ]
        6    [3400262248885649418, 3400263348397277194, ... ]
        7    [3400323271780990986, 3400323821536804874, ... ]
        8    [3126272197581078538, 3126272747336892426, ... ]
        9    [3447235034891681802, 3447246030007959562, ... ]
"""

##
# Add SIDs to DF
shp_file_sdf.set_sids(shp_file_sids, inplace=True)
r"""
>>>> type(shp_file_sdf) = <class 'starepandas.staredataframe.STAREDataFrame'>

       Satellite         Start           End Density                                           geometry                                               sids
    0  GOES-EAST  2019001 1530  2019001 1730   Light  POLYGON ((-77.083 26.317, -77.191 26.326, -77  ...)),  [3133450908998828042, 3133454757289525258, ...]
    1  GOES-EAST  2019001 1530  2019001 1730   Light  POLYGON ((-80.520 26.617, -80.639 26.867, -80  ...)),  [3119807069209755656, 3119536589349322761, ...]
    2  GOES-EAST  2019001 1530  2019001 1730   Light  POLYGON ((-83.684 30.809, -83.628 30.850, -83  ...)),  [3124957181674258442, 3124957731430072330, ...]
    3  GOES-EAST  2019001 1530  2019001 1730   Light  POLYGON ((-120.503 37.385, -120.537 37.433, -1 ...)),  [3330590045324181514, 3330601040440459274, ...]
    4  GOES-EAST  2019001 1530  2019001 1730   Light  POLYGON ((-120.403 37.165, -120.473 37.245, -1 ...)),  [3330590045324181514, 3330601040440459274, ...]
    5  GOES-EAST  2019001 1530  2019001 1730   Light  POLYGON ((-120.942 37.883, -120.930 37.924, -1 ...)),  [3330659314556731402, 3330663162847428618, ...]
    6  GOES-EAST  2019001 2217  2019001 2217   Light  POLYGON ((-83.000 22.690, -83.149 22.736, -83  ...)),  [3400262248885649418, 3400263348397277194, ...]
    7  GOES-EAST  2019001 2052  2019001 2052   Light  POLYGON ((-83.134 22.584, -83.308 22.645, -83  ...)),  [3400323271780990986, 3400323821536804874, ...]
    8  GOES-EAST  2019001 2147  2019001 2147   Light  POLYGON ((-77.092 26.319, -77.184 26.366, -77  ...)),  [3126272197581078538, 3126272747336892426, ...]
    9  GOES-EAST  2019001 2227  2019001 2227   Light  POLYGON ((-115.106 32.665, -115.138 32.465, -1 ...)),  [3447235034891681802, 3447246030007959562, ...]

>>>> type(shp_file_sdf['sids'].iloc[0]) = <class 'numpy.ndarray'>
    00 len(shp_file_sdf['sids'].iloc[row]) = 18
    01 len(shp_file_sdf['sids'].iloc[row]) = 73
    02 len(shp_file_sdf['sids'].iloc[row]) =  4
    03 len(shp_file_sdf['sids'].iloc[row]) = 10
    04 len(shp_file_sdf['sids'].iloc[row]) = 13
    05 len(shp_file_sdf['sids'].iloc[row]) =  8
    06 len(shp_file_sdf['sids'].iloc[row]) = 13
    07 len(shp_file_sdf['sids'].iloc[row]) =  8
    08 len(shp_file_sdf['sids'].iloc[row]) = 20
    09 len(shp_file_sdf['sids'].iloc[row]) = 32
"""

##
# Convert SIDs arrays to strings
for row in range(len(shp_file_sdf)):
    shp_file_sdf.at[row, 'sids'] = np.array2string(shp_file_sdf.loc[row, 'sids'])

##
# Save as a GeoPackage file
#
geopackage_fname = shapefile_fname.replace(".zip", ".gpkg")
shp_file_sdf.to_file(geopackage_fname, driver="GPKG")

del shp_file_sdf, shp_file_sids

##
# Read GeoPackage 
test_file_gdf = gpd.read_file(geopackage_fname)

##
# Back-Convert SID strings to numpy arrays
for row in range(len(test_file_gdf)):
    r"""
    test_file_gdf.loc[0, 'sids']           = '[3133450908998828042 3133454757289525258 3133456956312780810\n 3133696649847635978 3133698848870891530 3133699398626705418\n 3133699948382519306 3133702697161588746 3133703796673216522\n 3133704346429030410 3133837387335991306 3133837937091805194\n 3133838486847619082 3133840136115060746 3133841785382502410\n 3133843984405757962 3133844534161571850 3133845083917385738]'
    tmp                                    = '3133450908998828042 3133454757289525258 3133456956312780810 3133696649847635978 3133698848870891530 3133699398626705418 3133699948382519306 3133702697161588746 3133703796673216522 3133704346429030410 3133837387335991306 3133837937091805194 3133838486847619082 3133840136115060746 3133841785382502410 3133843984405757962 3133844534161571850 3133845083917385738'
    np.fromstring(tmp, dtype=int, sep=' ') = array([3133450908998828042, 3133454757289525258, ...  3133845083917385738])
    len(test_file_gdf.loc[0, 'sids'])      = 18
    >>>> {type(test_file_gdf['sids'].iloc[0][0]) = <class 'numpy.int64'>
    """
    tmp = test_file_gdf.loc[row, 'sids']
    tmp = tmp.replace('[', '').replace(']', '').replace('  ', ' ').replace('\n', '')
    test_file_gdf.at[row, 'sids'] = np.fromstring(tmp, dtype=int, sep=' ')

I guess something similar would work for Issue-A (converting the POLYGON geometry columns to strings and back again).

NiklasPhabian commented 1 year ago

Regarding issue A) Certainly a discussed topic in geopandas: https://github.com/geopandas/geopandas/issues/1490 Geopanda's to_file() supports writing to shapefiles and GeoJSON, neither of which allow for multiple geom columns. Therefore, to_file() will (for now) not allow writing GeoDataFrames with more than one geometry column. This is unfortunate since GeoDataFrames, GeoPackges, SpatiaLite, and PostGIS all very much do support multiple geometry columns. For now, we need customize writing/reading functions. Here is how I did for PostGIS. I will add this to STAREPandas ASAP. As soon as we are confident enough, we should also suggest this to the folks at geopandas

import geopandas
import pandas
import shapely.wkb
import shapely.geos
import geoalchemy2
import sqlalchemy
import psycopg2.extensions
import numpy

def load_geom_text(x):
        """Load from binary encoded as text."""
        return shapely.wkb.loads(str(x), hex=True)

def read(table_name, con):
    gdf = pandas.read_sql(f'SELECT * FROM {table_name}', con=con)    
    query = f"SELECT * FROM geometry_columns WHERE f_table_name = '{table_name}'"
    geom_columns = pandas.read_sql(query, con=con)

    query = """SELECT column_name 
           FROM information_schema.columns 
           WHERE table_name='cells_roi' 
           AND data_type='ARRAY'"""
    arrays = pandas.read_sql(query, con=con)['column_name'].tolist()

    for array in arrays:
        gdf[array] = gdf[array].apply(numpy.array)

    for column in geom_columns.f_geometry_column:                        
        geoms = gdf[column].apply(load_geom_text)
        crs = shapely.geos.lgeos.GEOSGetSRID(geoms[0]._geom)
        gdf[column] = geopandas.GeoSeries(geoms, crs=crs)
    return gdf

def get_geom_type(gdf, column):
    geom_types = list(gdf[column].geom_type.unique())
    if len(geom_types) == 1:
        target_geom_type = geom_types[0].upper()
    else:
        target_geom_type = "GEOMETRY"
    return target_geom_type

def addapt_numpy_float64(numpy_float64):
    return psycopg2.extensions.AsIs(numpy_float64)

def addapt_numpy_int64(numpy_int64):
    return psycopg2.extensions.AsIs(numpy_int64)

typemap ={'int16': sqlalchemy.types.Integer,
          'int64':  sqlalchemy.types.BigInteger,
          'object': sqlalchemy.types.ARRAY(sqlalchemy.types.BigInteger),
          'float64': sqlalchemy.types.Float}

def write(gdf, engine):
    gdf = gdf.copy()
    gdf = geopandas.GeoDataFrame(gdf)

    g_dtypes = {}
    for column, dtype in gdf.dtypes.items():        
        if dtype == 'geometry':
            dtype = get_geom_type(gdf, column)
            dtype = geoalchemy2.Geometry(dtype)
        elif dtype.name in typemap.keys():        

            dtype = typemap[dtype.name]
        g_dtypes[column] = dtype

    srid = 4326

    for column, dtype in gdf.dtypes.items():
        if dtype == 'geometry':
            gdf[column] = [shapely.wkb.dumps(geom, srid=srid, hex=True) for geom in gdf[column]]

    psycopg2.extensions.register_adapter(numpy.float64, addapt_numpy_float64)
    psycopg2.extensions.register_adapter(numpy.int64, addapt_numpy_int64)

    gdf.to_sql(name='cells', con=engine, if_exists='replace', dtype=g_dtypes, index=False)    

Regarding issue B) GeoPackages are based on SQLite and SQLite does support for arrays. Therefore they have to be serialized one way or the other. Converting to string is one way. Another one would be to convert to some binary form, e.g. via pickle or io. Probably nicer would be to register a datatype in the SQLite connection: https://stackoverflow.com/questions/18621513/python-insert-numpy-array-into-sqlite3-database

One way or another, we need to make reading and writing for both issue A) and B) transparent. E.g. overloading geopandas' to_file() or with a starepandas to_starepackage() (or to_starelite()) method. The latter would is nice since STARELite already has an int64array datatype.

michaelleerilee commented 1 year ago

Make a notebook example.

NiklasPhabian commented 10 months ago

I added the functionality. You can now do starepandas.to_postgis(sdf, engine, table_name). Eventually. sdf.to_postgis() would probably be nice