ptomasula commented 3 months ago

Summary

Much of the initial groundwork for processing the TDX Hydro files has been laid under issues #2, #3, #4 and with PRs #5 and #6. Its time to stitch that work together into a processing pipeline that modifies the raw TDX Hydro files by dropping and remaining fields, creating global LINKNO/streamID, and adding the modified nested set index information.

Closure Criteria

[x] Processing pipeline developed to do the follow
- [x] Convert LINKNO/streamID to global unique variety
- [x] Extraneous fields have been dropped from the streamnet and basins layers (see this method)
- [x] Modified nest set index information has been added to the basin layers
- [x] Streams with no basin geometries identified
[x] Pipeline run on all downloaded TDX Hydro files
[x] Processed TDX Hydro data have been output in compressed geoparquet, as recommended in #4
[x] Processed files uploaded to file exchange location (AWS S3, sharefile, etc.)

aufdenkampe commented 3 months ago

@rajadain, we have finalized our processing pipeline for the TDX Hydro stream network ('streamnet) and corresponding basins ('stream reach_basins') files. We are presently running the full set of files for the globe, which should be completed this afternoon.

In the meanwhile, here is an example set of three GeoParquet files that will be produced for each of the 62 TDX Hydro Regions (provided in the tdx_regions.parquet):

TDX_streamnet_7020038340_01.parquet, with stream reach metadata and full-resolution polyline geometries indexed by LINKNO.
TDX_streamreach_basins_mnsi_7020038340_01.parquet, with three modified nested set index (mnsi) fields and full-resolution polygon boundary geometries indexed by LINKNO.
TDX_streams_no_basin_7020038340_01.parquet
- These are the streamnet rows (LINKNO) that do not have basin geometries.

These supersede the files we shared with you two weeks ago under https://github.com/WikiWatershed/global-hydrography/issues/4#issuecomment-2234348477.

These files have been substantially compressed vs NGA's GeoPackage files.

@rajadain, could you work with your team to:

Ingest the 'streamnet' files into your vector tiling pipeline.
- We recognized that before you implement that pipeline that we need to provide feedback to https://github.com/WikiWatershed/model-my-watershed/issues/3625.
Explore approaches to find/select a LINKNO from a Lat/Lon (user clicking on map) and the 'streamreach_basins' geometries. This will likely first require determining the TDX Hydro Region from tdx_regions.parquet).
Provide us an way to get these 3x62 files to you.

We are getting close to delivering this to you:

7
9

All of the above will likely benefit from using a parallel set of simplified geometries, which we are also exploring.\

For now, read these files using the gpd.read_parquet(geoparquet_path) method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables, as described in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779. Note that either way, you need to have GDAL 3.9 installed, as described in that comment.

aufdenkampe commented 3 weeks ago

from @ptomasula's Oct 4 email to @rajadain:

We have uploaded parquet files with the modified nested set index (MNSI) information for 61 of the 62 TDXHydro regions to that S3 bucket. The missing files (5020054880) are for a region in Australia and failed during our initial run of the processing pipeline. We still wanted to get you over the bulk of the data since it will likely take some time to download and get integrated into the system. We’ll investigate that last file next week and get that over to you soon.

Anthony outlined a fair bit of this under this issue when he provided you with an example set of files, but I think it’s worth repeating here. For each TDXHydro region there are 3 files;

‘TDX_streamnet_mnsi’ contains the stream reach polylines and is indexed by the LINKNO field.
‘TDX_streamreach_basins_mnsi’ contains the full-resolution basin polygons. This is also indexed by the LINKNO field (renamed from ‘streamID’ in the original dataset to match naming convention)
‘TDX_streams_no_basin` contains the streamnet rows for which the LINKNO does not have a corresponding basin geometry.

In addition to the TDXHydro data fields, these files also each contain the MNSI fields. We’ll send a follow-up email with additional information and instructions on how to leverage the fields for delineation algorithms, but here is a brief explanation of the fields we have added.

ROOT_ID: identifies the downstream most stream reach or point of confluence for the watershed. This is useful in differentiating the watersheds when interpreting the rest of the MSNI fields.
DISCOVER_TIME: indicates the number of iterations in a depth first search to reach the stream reach
FINISH_TIME: indicates the number of iterations to revisit the reach stream.

For the basin files, there are also two additional fields to support pre-dissolving basin geometries and improve delineation performance.

DISSOLVE_ROOT_ID: identifies the most downstream elements of a subshed (grouping of basins to pre-dissolve).
ELEMENT_COUNT: indicates the number of upstream basins for a stream reach

Lastly, we have converted the index values in LINKNO, DSLINKNO, USLINKNO1, and USLINKNO2 into a globally unique version. You may recall that the index as provided by TDXHydro is only unique for a given region; however, we need a global unique identified for the entire dataset. We have applied logic based of the Geoglows V2 approach using the following equation LINKNO_NEW = LINKNO_OLD + (TDX_HEADER_NUMBER * 10_000_000).

rajadain commented 3 weeks ago

@ptomasula @aufdenkampe

Thanks for the info. I was able to ingest the GeoParquet files into PostGIS after some trial and error.

I ingested the TDX_streamnet_mnsi files to a tdxstreams table which will be used for analyzing streams, and for visualizing blue lines (still working on styling updates recommended in https://github.com/WikiWatershed/model-my-watershed/issues/3625#issuecomment-2371838760). I've added an index on stream_order (renamed from strmorder for consistency with NHD tables) to help with the visualization.

I ingested the TDX_streamreach_basins_mnsi to a tdxbasins table, which I imagine will be used for Global RWD based on a forthcoming algorithm. We may potentially also use these basins as Global HUC equivalents, perhaps.

Here's a couple questions I had:

What should I do with the TDX_streams_no_basin dataset? I have not yet ingested it. Should I add these to the tdxstreams table?
What additional fields (eg ROOT_ID, LINKNO, etc) should we add indexes to?

aufdenkampe commented 3 weeks ago

@rajadain, that's great news.

LINKNO serves as the primary key for all tables, so it should definitely be indexed or possibly even get set to the Feature ID (if that is a thing in PostGIS).

ROOT_ID is used for quickly subsetting the dataset for delineation (i.e. find nearest LINKNO and then select all records that share the same ROOT_ID). So it should probably also be indexed (although I'm not as familiar with PostgreSQL indexing).

The geometries in the TDX_streamreach_basins_mnsi.parquet are reach-level, so more equivalent to NHDplus catchments.

We developed the DISSOLVE_ROOT_ID to serve a similar purpose as a HUC. There are typically 200 LINKNO records for every unique DISSOLVE_ROOT_ID. So DISSOLVE_ROOT_ID should also be indexed. Our plan is to create a new set of simplified geometries for these, but I think we wanted to explore performance with the raw data first to decide if this was necessary.

aufdenkampe commented 2 weeks ago

@rajadain, please see our new example notebook, examples/5_DelineateWatershed.ipynb, for a walk-through on how to use our new fields for watershed delineation.

In my last commit, 3d7c0c28f537f5994572e57b514a400a29035461, I also demonstrated how to use the DISSOLVE_ROOT_ID and TopoSimplify to created HUC-like boundaries that could be used as an intermediate for rapid unions of basin polygons into a watershed boundary, if necessary.

Also, when using the gdf.dissolve() function, I found an 18.5x speedup with the method="coverage" option, optimized for non-overlapping polygons. I confirmed that this is appropriate for our dataset as it does not produce any invalid geometries.

WikiWatershed / global-hydrography

Preprocessing pipeline for TDX Hydro Files #8

Summary

Closure Criteria

7

9