Open ptomasula opened 3 months ago
@rajadain, we have finalized our processing pipeline for the TDX Hydro stream network ('streamnet) and corresponding basins ('stream reach_basins') files. We are presently running the full set of files for the globe, which should be completed this afternoon.
In the meanwhile, here is an example set of three GeoParquet files that will be produced for each of the 62 TDX Hydro Regions (provided in the tdx_regions.parquet):
These supersede the files we shared with you two weeks ago under https://github.com/WikiWatershed/global-hydrography/issues/4#issuecomment-2234348477.
These files have been substantially compressed vs NGA's GeoPackage files.
@rajadain, could you work with your team to:
We are getting close to delivering this to you:
All of the above will likely benefit from using a parallel set of simplified geometries, which we are also exploring.\
For now, read these files using the gpd.read_parquet(geoparquet_path)
method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables
, as described in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779. Note that either way, you need to have GDAL 3.9 installed, as described in that comment.
from @ptomasula's Oct 4 email to @rajadain:
We have uploaded parquet files with the modified nested set index (MNSI) information for 61 of the 62 TDXHydro regions to that S3 bucket. The missing files (5020054880) are for a region in Australia and failed during our initial run of the processing pipeline. We still wanted to get you over the bulk of the data since it will likely take some time to download and get integrated into the system. We’ll investigate that last file next week and get that over to you soon.
Anthony outlined a fair bit of this under this issue when he provided you with an example set of files, but I think it’s worth repeating here. For each TDXHydro region there are 3 files;
In addition to the TDXHydro data fields, these files also each contain the MNSI fields. We’ll send a follow-up email with additional information and instructions on how to leverage the fields for delineation algorithms, but here is a brief explanation of the fields we have added.
ROOT_ID
: identifies the downstream most stream reach or point of confluence for the watershed. This is useful in differentiating the watersheds when interpreting the rest of the MSNI fields.DISCOVER_TIME
: indicates the number of iterations in a depth first search to reach the stream reachFINISH_TIME
: indicates the number of iterations to revisit the reach stream.For the basin files, there are also two additional fields to support pre-dissolving basin geometries and improve delineation performance.
DISSOLVE_ROOT_ID
: identifies the most downstream elements of a subshed (grouping of basins to pre-dissolve).ELEMENT_COUNT
: indicates the number of upstream basins for a stream reachLastly, we have converted the index values in LINKNO
, DSLINKNO
, USLINKNO1
, and USLINKNO2
into a globally unique version. You may recall that the index as provided by TDXHydro is only unique for a given region; however, we need a global unique identified for the entire dataset. We have applied logic based of the Geoglows V2 approach using the following equation LINKNO_NEW = LINKNO_OLD + (TDX_HEADER_NUMBER * 10_000_000)
.
@ptomasula @aufdenkampe
Thanks for the info. I was able to ingest the GeoParquet files into PostGIS after some trial and error.
I ingested the TDX_streamnet_mnsi
files to a tdxstreams
table which will be used for analyzing streams, and for visualizing blue lines (still working on styling updates recommended in https://github.com/WikiWatershed/model-my-watershed/issues/3625#issuecomment-2371838760). I've added an index on stream_order
(renamed from strmorder
for consistency with NHD tables) to help with the visualization.
I ingested the TDX_streamreach_basins_mnsi
to a tdxbasins
table, which I imagine will be used for Global RWD based on a forthcoming algorithm. We may potentially also use these basins as Global HUC equivalents, perhaps.
Here's a couple questions I had:
TDX_streams_no_basin
dataset? I have not yet ingested it. Should I add these to the tdxstreams
table?@rajadain, that's great news.
LINKNO
serves as the primary key for all tables, so it should definitely be indexed or possibly even get set to the Feature ID (if that is a thing in PostGIS).
ROOT_ID
is used for quickly subsetting the dataset for delineation (i.e. find nearest LINKNO
and then select all records that share the same ROOT_ID
). So it should probably also be indexed (although I'm not as familiar with PostgreSQL indexing).
The geometries in the TDX_streamreach_basins_mnsi.parquet
are reach-level, so more equivalent to NHDplus catchments.
We developed the DISSOLVE_ROOT_ID
to serve a similar purpose as a HUC. There are typically 200 LINKNO
records for every unique DISSOLVE_ROOT_ID
. So DISSOLVE_ROOT_ID
should also be indexed. Our plan is to create a new set of simplified geometries for these, but I think we wanted to explore performance with the raw data first to decide if this was necessary.
@rajadain, please see our new example notebook, examples/5_DelineateWatershed.ipynb
, for a walk-through on how to use our new fields for watershed delineation.
In my last commit, 3d7c0c28f537f5994572e57b514a400a29035461, I also demonstrated how to use the DISSOLVE_ROOT_ID
and TopoSimplify to created HUC-like boundaries that could be used as an intermediate for rapid unions of basin polygons into a watershed boundary, if necessary.
Also, when using the gdf.dissolve()
function, I found an 18.5x speedup with the method="coverage"
option, optimized for non-overlapping polygons. I confirmed that this is appropriate for our dataset as it does not produce any invalid geometries.
Summary
Much of the initial groundwork for processing the TDX Hydro files has been laid under issues #2, #3, #4 and with PRs #5 and #6. Its time to stitch that work together into a processing pipeline that modifies the raw TDX Hydro files by dropping and remaining fields, creating global LINKNO/streamID, and adding the modified nested set index information.
Closure Criteria