Create script to convert NWM forcing inputs into ngen basins and format

jameshalgren commented 1 year ago

TL; DR

How do we convert the current NWM forcing data into inputs for the ngen catchments?

jameshalgren commented 1 year ago

Details

We are working on a pipeline to develop forcing inputs for the community NextGen The code snippets below should take about 5 minutes to execute, if you want to try them out.

The hydrofabric files

Please see this for a geopkg to Ngen input converter. With that script, the gpkg files in the AWS catalog for the current NWM hydrofabric artifacts(see also the hydrofabric github.io page)… can be turned into the catchment_data.geojson and nexus_data.geojson input files for a NextGen execution — there is only one other main input file, which is the realization.json that needs to be developed (along with the forcing information, of course.)

Forcings

And... those forcing files are what this issue is about!

Here's what we have so far:

Get some test data

We can download the forcing data pre-processed for the current NWM from the gcp bucket.

e.g., wget https://storage.googleapis.com/national-water-model/nwm.20220901/forcing_medium_range/nwm.t00z.medium_range.forcing.f001.conus.nc

(Note: there are cooler ways to do this which we are exploring in other experiments, but for now, it is easy to just download a few files and work with that.)

Also get the .gpkg file (for Alabama, of course!!) wget https://nextgen-hydrofabric.s3.amazonaws.com/v1.2/nextgen_03W.gpkg

read in the catchment boundaries where we will need to sample the forcing

import geopandas as gpd
f_03 = "nextgen_03W.gpkg" 
gpkg_03w_divides = gpd.read_file(f_03, layer="divides")

method #1

We tried this two ways. The first way relies on an intermediate file for the variable. It's fast. And it helps with debugging. Note, the Tiff export turns things into scaled integers, and we didn't figure out how to handle that if its important (it would be, but we were aiming for method #2 anyway...)

read the forcing dataset and write to geotiff

import xarray as xr
fc_nc = ("nwm.t00z.medium_range.forcing.f001.conus.nc")

# TODO LATER: Look more into the differences between these input methods
# https://docs.xarray.dev/en/stable/generated/xarray.open_rasterio.html
# https://corteva.github.io/rioxarray/stable/examples/convert_to_raster.html
# https://corteva.github.io/rioxarray/stable/getting_started/getting_started.html
# TODO LATER: make sure the imports are robust -- do these need to be loaded or installed?
# import rioxarray as rxr
# import rasterio

reng = "rasterio"
fc_xds = xr.open_dataset(fc_nc, engine=reng)
# NOTE: This is the same: fc_xds = rxr.open_rasterio(fc_nc)
# NOTE: This way doesn't work (the engine specification is important): fc_xds = xr.open_dataset(fc_nc) 
fc_xds["LWDOWN"].rio.to_raster("test1.tif")

read the geotiff and do zonal statistics on the catchments

with rasterio.open("test1.tif") as src:
    affine = src.transform
    array = src.read(1)
    df_zonal_stats = pd.DataFrame(zonal_stats(gpkg_03w_divides, array, affine=affine))

# adding statistics back to original GeoDataFrame
gdf2 = pd.concat([gpkg_03w_divides, df_zonal_stats], axis=1)

Method #2

This way, we avoid the intermediate file -- which is good for streamlining, but probably is harder to debug.

with xr.open_dataset(fc_nc, engine=reng) as _xds:
    _src = _xds.LWDOWN
    _aff2 = _src.rio.transform()
    _arr2 = _src.values[0]
    _df_zonal_stats = pd.DataFrame(zonal_stats(gpkg_03w_divides, _arr2, affine=_aff2))

# adding statistics back to original GeoDataFrame
gdf3 = pd.concat([gpkg_03w_divides, _df_zonal_stats], axis=1)
gdf3.to_csv

Success!

The output looks like this. We can use the mean value (or perhaps some other statistic... we'll need to think about that) as the value to bring over to the csv for this time-step for each catchment.

(Pdb) gdf3
               id    areasqkm      type        toid                                           geometry      min      max        mean  count
0      cat-107068   14.698357   network  nex-107069  MULTIPOLYGON (((1044735.000 841095.000, 104482...  199.125  199.333  199.233917     12
1      cat-107233    8.058603   network  nex-107234  MULTIPOLYGON (((1077765.001 1061085.003, 10778...  192.494  192.550  192.525000      5
2      cat-107683   17.203060   network  nex-107684  MULTIPOLYGON (((1038915.001 1106985.003, 10387...  193.985  194.136  194.069125     16
3      cat-108167   15.632991   network  nex-108168  MULTIPOLYGON (((1094685.002 1036725.001, 10943...  192.026  192.298  192.182143     14
4      cat-112977   17.297992   network  nex-112978  MULTIPOLYGON (((736335.000 1096815.003, 736184...  187.651  188.667  188.019111     18
...           ...         ...       ...         ...                                                ...      ...      ...         ...    ...
30867  cat-137554    0.022500   coastal  cnx-137554  MULTIPOLYGON (((648015.005 830264.997, 648074....      NaN      NaN         NaN      0
30868  cat-106885  136.690204  internal  inx-106885  MULTIPOLYGON (((986774.996 876974.997, 986835....  197.238  197.934  197.502941    135
30869  cat-137071    0.848702   coastal  cnx-137071  MULTIPOLYGON (((1152974.999 851804.998, 115300...  195.405  195.405  195.405000      1
30870  cat-137068    0.007199   coastal  cnx-137068  MULTIPOLYGON (((1153095.000 851805.003, 115300...      NaN      NaN         NaN      0
30871  cat-137748    5.238898   coastal  cnx-137748  MULTIPOLYGON (((621914.998 811425.005, 621914....  272.608  285.758  279.119333      6

[30872 rows x 9 columns]

jameshalgren commented 1 year ago

Next Steps

We need to loop through the variables and forecast times (or retrospective times) and write a giant concatenated .csv or .nc with the write information.

Or... figure out how to do streaming. But one step at a time.

jameshalgren commented 1 year ago

Ping @ZacharyWills @hellkite500

jameshalgren commented 1 year ago

This is pretty slow... so we'll want to multithread it. We can use the kerchunk tools, perhaps, to do collections of files at a time. Are there ways we can split up the calculation into clusters of inputs? (run 5 days, pause, run 5 more days, etc.?) Building the input files in clusters could make this easier to stream, grabbing the source files in manageable bunches.

jameshalgren commented 1 year ago

We'll need to pay close attention to the transform parameter if we take this into production somehow. This post has detailed discussion: https://github.com/perrygeo/python-rasterstats/blob/master/docs/notebooks/Integrating%20with%20GeoPandas%20and%20Numpy.ipynb

shahab122 commented 1 year ago

We compared the variable names between the forcing variables of ngen example files (https://github.com/NOAA-OWP/ngen/tree/master/data/forcing) and the forcing variables extracted using the above script. The following table shows that ngen example forcing includes both precip_rate and APCP_surface, while the forcing extracted by our script (from the GCP bucket) includes only RAINRATE (seems to be the same as precip_rate).
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

jameshalgren commented 1 year ago

Working on converting the outputs from the "new" way into something a little more structured for output to csv... see #26

AlabamaWaterInstitute / data_access_examples