Open fernando-aristizabal opened 1 year ago
@fernando-aristizabal I would never discourage the development of new tools. :) There's actually a stale issue about building retrospective client here: #157 We never actually got around to building the tool, but you may find some of the discussion useful. Us and others have encountered some difficulty reliably retrieving and validating the zarr data.
@fernando-aristizabal, thanks for opening this! I share @jarq6c's sentiment.
What do you envision the api(s) would return? A xarray.Dataset
or some flavor of dataframe (pandas
, dask
, etc.)?
Hey @aaraney, initial thought was to keep it to xarray since that's what natively works best for these zarr/netcdf files. It would also keep data lazily loaded and up to the user to slice or convert to a desired object type.
Given some of the issues with Zarr, has anyone produced a kerchunk index of the NetCDF retro data that we can use? It would load in a similar fashion and likely avoid some of the problems introduced in the Zarr rechuncking.
We might ask @mgdenno to contribute to this conversation. The TEEHR project (https://github.com/RTIInternational/teehr) has a system in place to retrieve these data (time series, point, and gridded) for exploratory evaluations. There's may be an opportunity to collaborate with CIROH.
I have a few thoughts to contribute to the conversation.
Regardless, we are certainly interested in collaborating on common tooling so we can try not to reinvent "the wheel".
FRSA @samlamont
I think that James Halgren's group at AWI just recently built the Kerchunk headers for the 2.1 retrospective NetCDF's on AWS. I think they are currently in an un-advertised AWS S3 bucket. @jameshalgren
@fernando-aristizabal Please take a look. https://ciroh-nwm-zarr-retrospective-data-copy.s3.amazonaws.com/index.html#noaa-nwm-retrospective-2-1-zarr-pds/ (Everyone is welcome to explore; only forcing data are complete there now, but we're working on a complete archive of materials.)
@igarousi, we should connect about this and add some material to the comment thread here.
Hey everyone!
Thanks for contributing to this! It seems like a great survey of the various efforts to better access NWM data. I'm going to rope in @GautamSood-NOAA and @sudhirshrestha-noaa who also have an interest in this and specifically what other variables might be useful to have rechuncked or indexed.
I'll start off commenting on @mgdenno's insightful points.
MultiZarrToZarr
for the purpose of creating local parquet file(s)? Moving on to @jameshalgren's info on some of the work that CIROH has been doing on this. It seems very helpful as a few CIROH people have reached out to me or mentioned their questions on NWM data access.
I took at look at the README.md but wasn't able to get a successful request.
>>> import requests
>>> d = requests.get('https://ciroh-nwm-zarr-retrospective-data-copy.s3.amazonaws.com/noaa-nwm-retrospective-2-1-zarr-pds/README.md')
>>> d.status_code
200
>>> d.text
'404: Not Found'
My understanding is that these are single file jsons for the forcing data? The forcing data seems of interest to people based on feedback. Is there a single multi-file json or a plan to?
I'd like to share that there is some work here showing how Zarr rechunked across time instead of features showed significant improvement in time series based queries. The repo for this is available here as well as more specifically here and here. This work was influenced by @jarq6c, @sudhirshrestha-noaa, and @AustinJordan-NOAA.
Hopefully this adds to the various efforts at improving NWM data access and builds towards generating a comprehensive solution for research and dissemination applications.
Hi all, this is Sam, I'm working with @mgdenno on the TEEHR tooling and have a few points/questions to add.
Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using MultiZarrToZarr
, although we found that for this use case, there is no real performance gain provided by the MultiZarrToZarr
step and are considering removing this step. In general, I think combining the single file jsons is helpful when you're dealing with many contiguous files (ie, the entire NWM retropective dataset) since it allows you to read the file metadata only once across the entire dataset. So far with TEEHR, we've been focusing on subsets of operational NWM forecasts (~monthly) and have not seen much advantage by including the MultiZarrToZarr
step. This is very much a work in progress however, and we welcome any feedback!
Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)?
If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the .nc
files, and the original NWM retrospective files consist of only one chunk across all features? Although I could be wrong here?
I'm also curious if sharding
could be helpful here? I believe this capability allows for a sort of nested chunking scheme and has been released as experimental. Could be something to investigate?
I hope these comments are helpful, happy to discuss further if not!
@samlamont Thanks for jumping in with interesting input.
Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using MultiZarrToZarr, although we found that for this use case, there is no real performance gain provided by the MultiZarrToZarr step and are considering removing this step. In general, I think combining the single file jsons is helpful when you're dealing with many contiguous files (ie, the entire NWM retropective dataset) since it allows you to read the file metadata only once across the entire dataset. So far with TEEHR, we've been focusing on subsets of operational NWM forecasts (~monthly) and have not seen much advantage by including the MultiZarrToZarr step. This is very much a work in progress however, and we welcome any feedback!
This is my general understanding as well since kerchunk doesn't actually rechunk the files just builds an index around them allowing for access of meta-data and lazy loading. What would you say the advantage of single file jsons are without aggregating them?
If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the .nc files, and the original NWM retrospective files consist of only one chunk across all features? Although I https://github.com/fsspec/kerchunk/issues/124?
Building on the previous comment, it's my understanding that the value of kerchunk is when building a multi-file json you get the advantages we've previously mentioned. Zarr offers the same benefits while also rechunking and recompressing in cloud optimized formats. The link I shared previously demonstrates how this can speed up access if done properly for the correct applications.
Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)?
This discussion started by wanting to add some of the AWS retro Zarr references to the hydrotools repo as to supplement the repo's existing NWM data access tools. @jarq6c brought up some concerns with the Zarr rechunking there and the conversation expanded to various indexing/chunking efforts. It's apparent there are many efforts here across groups without a clear, consistent solution to gather around.
During the time of this thread, I learned that @GautamSood-NOAA and @sudhirshrestha-noaa will also be doing some rechunking, they want to solicit feedback from SMEs on what variables in addition to streamflow, qSfcLatRunoff, and qBucket might be useful. I suggested the forcing data as well as the lake variables as they all maybe relevant for FIM eventually. They are eager for people's opinions to feel free to communicate your needs to them.
I'm also curious if sharding could be helpful here? I believe this capability allows for a sort of nested chunking scheme and has been released as https://github.com/zarr-developers/zarr-python/pull/1111#event-8421675745. Could be something to investigate?
Lastly, sharding seems like a partial chunk read? It's hard to tell because some of their links appear down. If so I'm sure this would add value if we have large chunks with specific queries.
Hi @fernando-aristizabal thanks for the feedback. On the single json vs. aggregated approach for NWM forecasts, we noticed a much smoother Dask task stream when using the single file jsons as opposed to aggregating with MultiZarrToZarr
. The overall performance/run time was about the same however so I'm not sure I can say there was a huge advantage. I did notice that when concatentating forecasts, MultiZarrToZarr
will append nan
values to the individual forecasts in order to build a contiguous array over the requested time period. Again, not sure what the overall impact of this behavior is (if any), so I might be over-complicating things here, and should apologize if I'm taking this thread down a technical rabbit hole! 😃
Thanks for the additional clarification, I'm happy to contribute to this effort in any way. I'll post back here if I learn of any benefits to sharding.
As the NWM Client seems to focus on forecast data from 2018 onwards on GCP or the past two days on NOMADS, I've thought about the retrospective somewhat.
AWS publishes three versions of NWM retrospective analysis:
At least two of the versions, 2.1 and 2.0, have been rechunked to Zarr which make for easy ingest:
After your imports:
The 2.1 dataset is as follows:
Other variables are available such as precipitation:
The 2.0 data seems as so:
While this seems easy enough, it might be useful to write a function to abstract some of these URIs away to something that caters to domain scientists. Please let me know if this is of interest to you all and I might be able to get to it in the next few weeks as @GregoryPetrochenkov-NOAA and myself will be doing some FIM evals using NWM data in the near-future.