WikiWatershed / global-hydrography

Scripts to explore and process global hydrography (stream lines and basin boundaries) for Model My Watershed
MIT License
1 stars 0 forks source link

Preprocessing pipeline for TDX Hydro Files #8

Open ptomasula opened 3 months ago

ptomasula commented 3 months ago

Summary

Much of the initial groundwork for processing the TDX Hydro files has been laid under issues #2, #3, #4 and with PRs #5 and #6. Its time to stitch that work together into a processing pipeline that modifies the raw TDX Hydro files by dropping and remaining fields, creating global LINKNO/streamID, and adding the modified nested set index information.

Closure Criteria

aufdenkampe commented 3 months ago

@rajadain, we have finalized our processing pipeline for the TDX Hydro stream network ('streamnet) and corresponding basins ('stream reach_basins') files. We are presently running the full set of files for the globe, which should be completed this afternoon.

In the meanwhile, here is an example set of three GeoParquet files that will be produced for each of the 62 TDX Hydro Regions (provided in the tdx_regions.parquet):

These supersede the files we shared with you two weeks ago under https://github.com/WikiWatershed/global-hydrography/issues/4#issuecomment-2234348477.

These files have been substantially compressed vs NGA's GeoPackage files.

@rajadain, could you work with your team to:

We are getting close to delivering this to you:

All of the above will likely benefit from using a parallel set of simplified geometries, which we are also exploring.\

For now, read these files using the gpd.read_parquet(geoparquet_path) method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables, as described in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779. Note that either way, you need to have GDAL 3.9 installed, as described in that comment.

aufdenkampe commented 3 weeks ago

from @ptomasula's Oct 4 email to @rajadain:

We have uploaded parquet files with the modified nested set index (MNSI) information for 61 of the 62 TDXHydro regions to that S3 bucket. The missing files (5020054880) are for a region in Australia and failed during our initial run of the processing pipeline. We still wanted to get you over the bulk of the data since it will likely take some time to download and get integrated into the system. We’ll investigate that last file next week and get that over to you soon.

Anthony outlined a fair bit of this under this issue when he provided you with an example set of files, but I think it’s worth repeating here. For each TDXHydro region there are 3 files;

In addition to the TDXHydro data fields, these files also each contain the MNSI fields. We’ll send a follow-up email with additional information and instructions on how to leverage the fields for delineation algorithms, but here is a brief explanation of the fields we have added.

For the basin files, there are also two additional fields to support pre-dissolving basin geometries and improve delineation performance.

Lastly, we have converted the index values in LINKNO, DSLINKNO, USLINKNO1, and USLINKNO2 into a globally unique version. You may recall that the index as provided by TDXHydro is only unique for a given region; however, we need a global unique identified for the entire dataset. We have applied logic based of the Geoglows V2 approach using the following equation LINKNO_NEW = LINKNO_OLD + (TDX_HEADER_NUMBER * 10_000_000).

rajadain commented 3 weeks ago

@ptomasula @aufdenkampe

Thanks for the info. I was able to ingest the GeoParquet files into PostGIS after some trial and error.

I ingested the TDX_streamnet_mnsi files to a tdxstreams table which will be used for analyzing streams, and for visualizing blue lines (still working on styling updates recommended in https://github.com/WikiWatershed/model-my-watershed/issues/3625#issuecomment-2371838760). I've added an index on stream_order (renamed from strmorder for consistency with NHD tables) to help with the visualization.

I ingested the TDX_streamreach_basins_mnsi to a tdxbasins table, which I imagine will be used for Global RWD based on a forthcoming algorithm. We may potentially also use these basins as Global HUC equivalents, perhaps.

Here's a couple questions I had:

  1. What should I do with the TDX_streams_no_basin dataset? I have not yet ingested it. Should I add these to the tdxstreams table?
  2. What additional fields (eg ROOT_ID, LINKNO, etc) should we add indexes to?
aufdenkampe commented 3 weeks ago

@rajadain, that's great news.

LINKNO serves as the primary key for all tables, so it should definitely be indexed or possibly even get set to the Feature ID (if that is a thing in PostGIS).

ROOT_ID is used for quickly subsetting the dataset for delineation (i.e. find nearest LINKNO and then select all records that share the same ROOT_ID). So it should probably also be indexed (although I'm not as familiar with PostgreSQL indexing).

The geometries in the TDX_streamreach_basins_mnsi.parquet are reach-level, so more equivalent to NHDplus catchments.

We developed the DISSOLVE_ROOT_ID to serve a similar purpose as a HUC. There are typically 200 LINKNO records for every unique DISSOLVE_ROOT_ID. So DISSOLVE_ROOT_ID should also be indexed. Our plan is to create a new set of simplified geometries for these, but I think we wanted to explore performance with the raw data first to decide if this was necessary.

aufdenkampe commented 2 weeks ago

@rajadain, please see our new example notebook, examples/5_DelineateWatershed.ipynb, for a walk-through on how to use our new fields for watershed delineation.

In my last commit, 3d7c0c28f537f5994572e57b514a400a29035461, I also demonstrated how to use the DISSOLVE_ROOT_ID and TopoSimplify to created HUC-like boundaries that could be used as an intermediate for rapid unions of basin polygons into a watershed boundary, if necessary.

Also, when using the gdf.dissolve() function, I found an 18.5x speedup with the method="coverage" option, optimized for non-overlapping polygons. I confirmed that this is appropriate for our dataset as it does not produce any invalid geometries.