NOAA-OWP / t-route

Tree based hydrologic and hydraulic routing
Other
44 stars 50 forks source link

drop down to sqlite3 for gpkg read to avoid gdal dependencies #604

Open jameshalgren opened 1 year ago

jameshalgren commented 1 year ago

https://github.com/NOAA-OWP/t-route/blob/01c713acb02e03e1d4222b62c398c30981485030/src/troute-network/troute/HYFeaturesNetwork.py#L16-L20

In most cases (in every case?) for t-route, we are discarding the geometry data and only use the geo-enabled packages for convenience in reading the files where the network connectivity information is stored.

To avoid unnecessary (see note) dependencies, we could update this function to look something like the following:

## TESTED; See note below
import sqlite3
import pandas as pd

def read_geopkg(file_path):
    flowpaths = read_geopkg_layer_sqlite3(file_path, layer="flowpaths")
    attributes = read_geopkg_layer_sqlite3(file_path, layer="flowpath_attributes")
    #merge all relevant data into a single dataframe
    return pd.merge(flowpaths, attributes, on='id')

def read_geopkg_layer_sqlite3(file_path, layer):
    con = sqlite3.connect(file_path)
    return pd.read_sql_query("SELECT * from " + layer, con)

Note: More accurately, "troublesome" dependencies. Gdal has been causing some trouble on the arm64 builds and if we can circumvent those issues, we gain (potentially) quite a bit of performance. Note2: The geopandas read is (I believe) lazy, so this direct sql read with the connection closing will carry a performance hit, potentially, especially for very large flowpath tables.

jameshalgren commented 1 year ago

Tested this by using the old function and the new function to read a geopackage for the 03w VPU:

wget -P 03w -c https://nextgen-hydrofabric.s3.amazonaws.com/v1.2/nextgen_03W.gpkg

Then compare the outputs:

d = read_geopkg("03w/nextgen_03W.gpkg")
d2 = read_geopkg_sqlite3("03w/nextgen_03W.gpkg")
d.drop(["geometry"],axis=1) == d2.drop(["geom", "fid_x", "fid_y"],axis=1)
## Returns all True

The only difference is the geometry.

Useful comments on datacarpentry.org here.

jameshalgren commented 1 year ago

ping @ZacharyWills

groutr commented 1 year ago

Your sql statement is a special case that has its own convenience function, pd.read_sql_table.

pd.read_sql_table('flowpaths', con, index_col='id')
jameshalgren commented 2 months ago

Your sql statement is a special case that has its own convenience function, pd.read_sql_table.

pd.read_sql_table('flowpaths', con, index_col='id')

Python 3.10+