Closed JordanLaserGit closed 8 months ago
Hi @JordanLaserGit, @program--
https://github.com/LynkerIntel/hfsubset/issues/28
Please take a look at this new capability and data source and let me know if its looking like what you need. I have to say it is pretty slick and actually reverts back to our initial data products pre forcing workstream!
I think the details might need a little TLC but lets see how you guys use them and we can adjust:
# Download per the link you shared and read in ...
fg = terra::rast('/Users/mjohnson/Downloads/nwm.t00z.medium_range.forcing.f001.conus.nc')
# Define divide ids however you like, here is a 2 divide example
divs = c("cat-1866993", "cat-1866995")
# Extract precomputed weights for the forcing input and divides
w = arrow::open_dataset('s3://lynker-spatial/v20.1/forcing_weights.parquet') |>
dplyr::filter(divide_id %in% divs, grid_id == "medium_range.forcing.conus") |>
dplyr::collect()
# Extract cell level values from forcing grid
x = zonal::weight_grid_to_data(fg, w = w)
# Explore output
head(x)
#> divide_id cell coverage_fraction grid_id U2D
#> 1: cat-1866993 7632614 0.006292699 medium_range.forcing.conus -1.083
#> 2: cat-1866993 7632615 0.083473243 medium_range.forcing.conus -1.089
#> 3: cat-1866993 7637220 0.007748126 medium_range.forcing.conus -1.039
#> 4: cat-1866993 7637221 0.167723924 medium_range.forcing.conus -1.045
#> 5: cat-1866993 7637222 0.868241370 medium_range.forcing.conus -1.052
#> 6: cat-1866993 7637223 0.815731525 medium_range.forcing.conus -1.058
#> V2D LWDOWN RAINRATE T2D Q2D PSFC SWDOWN
#> 1: -1.317 213.503 2.576764e-06 259.58 0.001265 83949.5 0
#> 2: -1.266 213.116 2.609972e-06 259.48 0.001254 84060.7 0
#> 3: -1.385 214.007 3.093114e-06 259.78 0.001284 83776.8 0
#> 4: -1.347 213.779 3.045424e-06 259.65 0.001272 83803.0 0
#> 5: -1.307 213.526 3.004070e-06 259.55 0.001262 83893.8 0
#> 6: -1.267 213.249 2.953562e-06 259.46 0.001253 84004.0 0
# Or, re-scale on the fly...
(x2 = zonal::execute_zonal(fg, w = w, ID = "divide_id"))
#> divide_id grid_id U2D V2D LWDOWN
#> 1: cat-1866993 medium_range.forcing.conus -0.9880872 -1.219789 213.9363
#> 2: cat-1866995 medium_range.forcing.conus -0.9106252 -1.148223 215.3706
#> RAINRATE T2D Q2D PSFC SWDOWN
#> 1: 4.062485e-06 259.5877 0.001264434 84141.37 0
#> 2: 5.190276e-06 259.7551 0.001279137 84386.35 0
# You can find the valid grid data structures in the top level JSON
jsonlite::fromJSON('https://lynker-spatial.s3.amazonaws.com/v20.1/forcing_grids.json',
simplifyDataFrame = TRUE)
#> grid_id schema X1 Xn Y1 Yn resX
#> 1 medium_range.forcing.conus climateR -2303999 2304001 -1920000 1920000 1000
#> resY ncols nrows
#> 1 1000 4608 3840
#> crs
#> 1 +proj=lcc +lat_0=40 +lon_0=-97 +lat_1=30 +lat_2=60 +x_0=0 +y_0=0 +R=6370000 +units=m +no_defs
Created on 2024-02-06 with reprex v2.0.2
attn: @ZacharyWills
@mikejohnson51 this is exactly what I was hoping for! will implement this week and i'll get back to you. thank you sir!
@mikejohnson51 Thoughts on reformatting the parquet to something like this?
Motivation for me is the forcingprocessor has an internal loop through catchments. It'll need to do a seek for each catchment key in the current file layout. I can prepare my own file from s3://lynker-spatial/v20.1/forcing_weights.parquet easily if it's preferable to keep the current data layout.
Not opposed per say that looks a lot like a JSON rather parquet based on my rudimentary knowledge. I would consider the advantages of the tabular form rather then loop for a few reasons that were the basis for zonal. I can spit out a converted file for testing though!
oh it is json! i phrased that poorly. no worries, I can create my own for our conus daily run. main purpose of this was to have standardized weights so it's all good! thanks @mikejohnson51
@JordanLaserGit just to elaborate a bit on that top point, I think working from the table can be very fast and I would wager, be a preferred approach to looping. Below is a CONUS regrid on my local laptop, no multi processing, parallelization or all the stuff you guys rock at. If its of interest maybe we can write a execute_zonal
equivalent for your python needs?
So I imagine with the way you can distribute processing you could rescale CONUS for an 18 hour forecast in a minute over a few processers. Im now out of my depths :)
man that's fast! forcingprocessor takes about 40 seconds to create something equivalent to oo
. With 24 cores it takes a little over 3 minutes to read in 24hrs worth of nwm forcing data and write the 800k catchment csvs out. This is a small portion of the total datastream runtime, so not a pressing issue, but if zonal is faster, probably makes sense to move away from my python method and towards zonal in the long run.
@JordanLaserGit You may also consider trying to work with pyarrow RecordBatches instead of loading in the entire parquet file if you're not already, since you'll know the divide IDs beforehand (or use polars
, since it's better than pandas 😛). For example:
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.dataset
divs = ["cat-1866993", "cat-1866995"]
def print_batches():
w = pa.dataset.dataset(
's3://lynker-spatial/v20.1/forcing_weights.parquet', format='parquet'
).filter(
pc.field('divide_id').isin(divs)
).to_batches()
batch: pa.RecordBatch
for batch in w:
tbl = batch.to_pandas()
if tbl.empty:
continue
print(tbl)
#> divide_id cell coverage_fraction grid_id
#> 0 cat-1866993 7632614.0 0.006293 medium_range.forcing.conus
#> 1 cat-1866993 7632615.0 0.083473 medium_range.forcing.conus
#> 2 cat-1866993 7637220.0 0.007748 medium_range.forcing.conus
#> .. ... ... ... ...
#> 43 cat-1866995 7669480.0 0.101241 medium_range.forcing.conus
#> 44 cat-1866995 7669481.0 0.099398 medium_range.forcing.conus
if __name__ == '__main__':
import timeit
print(timeit.timeit("print_batches()", globals=locals(), number=3))
#> 10.545011819922365
With the pyarrow compute kernel, you can also perform aggregation before reading through the batches IIRC. If I can get an example, I'll edit it onto here.
@program-- that's handy! our main application is a conus run so we have to load the whole thing anyway, but this will be helpful when we are subsetting, which we are supporting.
@program-- that's handy! our main application is a conus run so we have to load the whole thing anyway, but this will be helpful when we are subsetting, which we are supporting.
I'd consider it for CONUS too -- you can potentially group batches by ID, then grouped aggregation becomes embarrassingly parallel. Since you're streaming records, the memory footprint is lower too (polars has some handy methods for doing all of this via lazy frames)
@program-- i'll give it a try!
Catchment weights (the indices within the lat/lon divides of a particular catchment relative to some grid) are needed for extracting forcing data. The method of calculating these weights can vary and yield different weights depending on methods used See issue 28 in hfsubset. Also, these weights may change with each hydrofabric release. To help ensure catchment weights are consistent for a particular hydrofabric and grid, I suggest generating conus weight files for common grids(projections) that can be subsetted via a tool like hfsubset.
For example, NWM v3 uses a lambert conformal conic projection.
https://noaa-nwm-pds.s3.amazonaws.com/nwm.20240112/forcing_medium_range/nwm.t00z.medium_range.forcing.f001.conus.nc
The projection type and hydrofabric can be stored as fields in the weights.json. This way forcings engines could check that the weight projection is appropriate for the nwm file being processed. Information about the grid should be stored in the json as well (dx, dy, nx, ny, x0, y0).
This will place more responsibility on the hydrofabric, but might be worth it in the long run to avoid weight generation algorithms yielding different weights for the same hydrofabric and grid.