AWS Retrospective - Githubissues

fernando-aristizabal commented 1 year ago

As the NWM Client seems to focus on forecast data from 2018 onwards on GCP or the past two days on NOMADS, I've thought about the retrospective somewhat.

AWS publishes three versions of NWM retrospective analysis:

A 42-year (February 1979 through December 2020) retrospective simulation using version 2.1 of the National Water Model.
A 26-year (January 1993 through December 2018) retrospective simulation using version 2.0 of the National Water Model.
A 25-year (January 1993 through December 2017) retrospective simulation using version 1.2 of the National Water Model.

At least two of the versions, 2.1 and 2.0, have been rechunked to Zarr which make for easy ingest:

After your imports:

>>> import xarray as XR
>>> from fsspec import get_mapper

The 2.1 dataset is as follows:

ds = xr.open_zarr(get_mapper("s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr/"), consolidated=True)
>>> ds
<xarray.Dataset>
Dimensions:     (feature_id: 2776738, time: 367439)
Coordinates:
    elevation   (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray>
  * feature_id  (feature_id) int32 101 179 181 ... 1180001803 1180001804
    gage_id     (feature_id) |S15 dask.array<chunksize=(2776738,), meta=np.ndarray>
    latitude    (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray>
    longitude   (feature_id) float32 dask.array<chunksize=(2776738,), meta=np.ndarray>
    order       (feature_id) int32 dask.array<chunksize=(2776738,), meta=np.ndarray>
  * time        (time) datetime64[ns] 1979-02-01T01:00:00 ... 2020-12-31T23:0...
Data variables:
    crs         |S1 ...
    streamflow  (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    velocity    (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
Attributes:
    TITLE:                OUTPUT FROM WRF-Hydro v5.2.0-beta2
    code_version:         v5.2.0-beta2
    featureType:          timeSeries
    model_configuration:  retrospective
    proj4:                +proj=lcc +units=m +a=6370000.0 +b=6370000.0 +lat_1...

Other variables are available such as precipitation:

>>> ds = xr.open_zarr(get_mapper("s3://noaa-nwm-retrospective-2-1-zarr-pds/precip.zarr/"), consolidated=True)
>>> ds
<xarray.Dataset>
Dimensions:   (time: 367440, y: 3840, x: 4608)
Coordinates:
  * time      (time) datetime64[ns] 1979-02-01 ... 2020-12-31T23:00:00
  * x         (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06
  * y         (y) float64 -1.92e+06 -1.919e+06 ... 1.918e+06 1.919e+06
Data variables:
    RAINRATE  (time, y, x) float32 dask.array<chunksize=(672, 350, 350), meta=np.ndarray>
    crs       |S1 ...
Attributes:
    NWM_version_number:   v2.1
    model_configuration:  AORC
    model_output_type:    forcing

The 2.0 data seems as so:

>>> ds
<xarray.Dataset>
Dimensions:         (time: 227904, feature_id: 2729077)
Coordinates:
  * feature_id      (feature_id) int32 101 179 181 ... 1180001803 1180001804
    latitude        (feature_id) float32 dask.array<chunksize=(2729077,), meta=np.ndarray>
    longitude       (feature_id) float32 dask.array<chunksize=(2729077,), meta=np.ndarray>
  * time            (time) datetime64[ns] 1993-01-01 ... 2018-12-31T23:00:00
Data variables:
    elevation       (time, feature_id) float32 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    order           (time, feature_id) int32 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    qBtmVertRunoff  (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    qBucket         (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    qSfcLatRunoff   (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    q_lateral       (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    streamflow      (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
    velocity        (time, feature_id) float64 dask.array<chunksize=(672, 30000), meta=np.ndarray>
Attributes: (12/17)
    Conventions:                CF-1.6
    cdm_datatype:               Station
    code_version:               v5.1.0-alpha11
    dev:                        dev_ prefix indicates development/internal me...
    dev_NOAH_TIMESTEP:          3600
    dev_OVRTSWCRT:              1
    ...                         ...
    model_output_type:          channel_rt
    model_output_valid_time:    2018-12-28_00:00:00
    model_total_valid_times:    2208
    proj4:                      +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
    station_dimension:          feature_id
    stream_order_output:        1

While this seems easy enough, it might be useful to write a function to abstract some of these URIs away to something that caters to domain scientists. Please let me know if this is of interest to you all and I might be able to get to it in the next few weeks as @GregoryPetrochenkov-NOAA and myself will be doing some FIM evals using NWM data in the near-future.

jarq6c commented 1 year ago

@fernando-aristizabal I would never discourage the development of new tools. :) There's actually a stale issue about building retrospective client here: #157 We never actually got around to building the tool, but you may find some of the discussion useful. Us and others have encountered some difficulty reliably retrieving and validating the zarr data.

aaraney commented 1 year ago

@fernando-aristizabal, thanks for opening this! I share @jarq6c's sentiment.

What do you envision the api(s) would return? A xarray.Dataset or some flavor of dataframe (pandas, dask, etc.)?

fernando-aristizabal commented 1 year ago

Hey @aaraney, initial thought was to keep it to xarray since that's what natively works best for these zarr/netcdf files. It would also keep data lazily loaded and up to the user to slice or convert to a desired object type.

Given some of the issues with Zarr, has anyone produced a kerchunk index of the NetCDF retro data that we can use? It would load in a similar fashion and likely avoid some of the problems introduced in the Zarr rechuncking.

jarq6c commented 1 year ago

We might ask @mgdenno to contribute to this conversation. The TEEHR project (https://github.com/RTIInternational/teehr) has a system in place to retrieve these data (time series, point, and gridded) for exploratory evaluations. There's may be an opportunity to collaborate with CIROH.

mgdenno commented 1 year ago

I have a few thoughts to contribute to the conversation.

We have some tools in the TEEHR library that use Dask to parallelize the building of the Kerchunck headers (JSON files). We have been building them and storing them locally as needed and then using them to access the data with XArray. This seems to work pretty well. It is obviously not as fast as it would be if the files were already generated, but provides significant speed up compared to downloading an entire NetCFD file to pull out one variable. So far we are only doing this for the data on GCP as we are using the Zarr files in AWS for the retrospective, but it could be pretty easily extended to the NetCDF data in AWS. This may not be necessary though - see item 3 below.
As far as the gridded data is concerned, our tools aggregate the gridded data values to polygons (think basin average precipitation) but could pretty easily be refactored to provide a tool that returns an XArray (this may already be possible, I'd have to look to see).
I think that James Halgren's group at AWI just recently built the Kerchunk headers for the 2.1 retrospective NetCDF's on AWS. I think they are currently in an un-advertised AWS S3 bucket. @jameshalgren

Regardless, we are certainly interested in collaborating on common tooling so we can try not to reinvent "the wheel".

FRSA @samlamont

jameshalgren commented 1 year ago

I think that James Halgren's group at AWI just recently built the Kerchunk headers for the 2.1 retrospective NetCDF's on AWS. I think they are currently in an un-advertised AWS S3 bucket. @jameshalgren

@fernando-aristizabal Please take a look. https://ciroh-nwm-zarr-retrospective-data-copy.s3.amazonaws.com/index.html#noaa-nwm-retrospective-2-1-zarr-pds/ (Everyone is welcome to explore; only forcing data are complete there now, but we're working on a complete archive of materials.)

@igarousi, we should connect about this and add some material to the comment thread here.

fernando-aristizabal commented 12 months ago

Hey everyone!

Thanks for contributing to this! It seems like a great survey of the various efforts to better access NWM data. I'm going to rope in @GautamSood-NOAA and @sudhirshrestha-noaa who also have an interest in this and specifically what other variables might be useful to have rechuncked or indexed.

I'll start off commenting on @mgdenno's insightful points.

Thanks for sharing some of TEEHR's data access methods. It seems as if some of the low level functionality is located here and here. My understanding is that this tooling creates single file jsons for the GCP data? Also, It seems to call MultiZarrToZarr for the purpose of creating local parquet file(s)?
This sort of tooling looks for useful to add-on!
👍

Moving on to @jameshalgren's info on some of the work that CIROH has been doing on this. It seems very helpful as a few CIROH people have reached out to me or mentioned their questions on NWM data access.

I took at look at the README.md but wasn't able to get a successful request.

>>> import requests
>>> d = requests.get('https://ciroh-nwm-zarr-retrospective-data-copy.s3.amazonaws.com/noaa-nwm-retrospective-2-1-zarr-pds/README.md')
>>> d.status_code
200
>>> d.text
'404: Not Found'

My understanding is that these are single file jsons for the forcing data? The forcing data seems of interest to people based on feedback. Is there a single multi-file json or a plan to?

I'd like to share that there is some work here showing how Zarr rechunked across time instead of features showed significant improvement in time series based queries. The repo for this is available here as well as more specifically here and here. This work was influenced by @jarq6c, @sudhirshrestha-noaa, and @AustinJordan-NOAA.

Hopefully this adds to the various efforts at improving NWM data access and builds towards generating a comprehensive solution for research and dissemination applications.

samlamont commented 12 months ago

Hi all, this is Sam, I'm working with @mgdenno on the TEEHR tooling and have a few points/questions to add.

Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using MultiZarrToZarr, although we found that for this use case, there is no real performance gain provided by the MultiZarrToZarr step and are considering removing this step. In general, I think combining the single file jsons is helpful when you're dealing with many contiguous files (ie, the entire NWM retropective dataset) since it allows you to read the file metadata only once across the entire dataset. So far with TEEHR, we've been focusing on subsets of operational NWM forecasts (~monthly) and have not seen much advantage by including the MultiZarrToZarr step. This is very much a work in progress however, and we welcome any feedback!

Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)?

If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the .nc files, and the original NWM retrospective files consist of only one chunk across all features? Although I could be wrong here?

I'm also curious if sharding could be helpful here? I believe this capability allows for a sort of nested chunking scheme and has been released as experimental. Could be something to investigate?

I hope these comments are helpful, happy to discuss further if not!

fernando-aristizabal commented 11 months ago

@samlamont Thanks for jumping in with interesting input.

Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using MultiZarrToZarr, although we found that for this use case, there is no real performance gain provided by the MultiZarrToZarr step and are considering removing this step. In general, I think combining the single file jsons is helpful when you're dealing with many contiguous files (ie, the entire NWM retropective dataset) since it allows you to read the file metadata only once across the entire dataset. So far with TEEHR, we've been focusing on subsets of operational NWM forecasts (~monthly) and have not seen much advantage by including the MultiZarrToZarr step. This is very much a work in progress however, and we welcome any feedback!

This is my general understanding as well since kerchunk doesn't actually rechunk the files just builds an index around them allowing for access of meta-data and lazy loading. What would you say the advantage of single file jsons are without aggregating them?

If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the .nc files, and the original NWM retrospective files consist of only one chunk across all features? Although I https://github.com/fsspec/kerchunk/issues/124?

Building on the previous comment, it's my understanding that the value of kerchunk is when building a multi-file json you get the advantages we've previously mentioned. Zarr offers the same benefits while also rechunking and recompressing in cloud optimized formats. The link I shared previously demonstrates how this can speed up access if done properly for the correct applications.

Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)?

This discussion started by wanting to add some of the AWS retro Zarr references to the hydrotools repo as to supplement the repo's existing NWM data access tools. @jarq6c brought up some concerns with the Zarr rechunking there and the conversation expanded to various indexing/chunking efforts. It's apparent there are many efforts here across groups without a clear, consistent solution to gather around.

During the time of this thread, I learned that @GautamSood-NOAA and @sudhirshrestha-noaa will also be doing some rechunking, they want to solicit feedback from SMEs on what variables in addition to streamflow, qSfcLatRunoff, and qBucket might be useful. I suggested the forcing data as well as the lake variables as they all maybe relevant for FIM eventually. They are eager for people's opinions to feel free to communicate your needs to them.

I'm also curious if sharding could be helpful here? I believe this capability allows for a sort of nested chunking scheme and has been released as https://github.com/zarr-developers/zarr-python/pull/1111#event-8421675745. Could be something to investigate?

Lastly, sharding seems like a partial chunk read? It's hard to tell because some of their links appear down. If so I'm sure this would add value if we have large chunks with specific queries.

samlamont commented 11 months ago

Hi @fernando-aristizabal thanks for the feedback. On the single json vs. aggregated approach for NWM forecasts, we noticed a much smoother Dask task stream when using the single file jsons as opposed to aggregating with MultiZarrToZarr. The overall performance/run time was about the same however so I'm not sure I can say there was a huge advantage. I did notice that when concatentating forecasts, MultiZarrToZarr will append nan values to the individual forecasts in order to build a contiguous array over the requested time period. Again, not sure what the overall impact of this behavior is (if any), so I might be over-complicating things here, and should apologize if I'm taking this thread down a technical rabbit hole! 😃

Thanks for the additional clarification, I'm happy to contribute to this effort in any way. I'll post back here if I learn of any benefits to sharding.

NOAA-OWP / hydrotools

AWS Retrospective #226