blaylockbk / Herbie

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.
https://herbie.readthedocs.io/
MIT License
493 stars 74 forks source link

Is the subsetting of downloaded files working correctly? #25

Open blaylockbk opened 3 years ago

blaylockbk commented 3 years ago

(issue brought up by @danieldjewell)

If you have downloaded the full GRIB2 and then run the .xarray('HGT') method to open a subset, it appears that Herbie won't subset the file that was already downloaded, but will instead redownload the requested subset. In other words, instead of opening the full GRIB2 that was already downloaded and grabbing all of the "HGT" records, it'll redownload the GRIB2 "HGT" records from the remote source URL.

from herbie.archive import Herbie

# Download the full file
h = Herbie('2021-09-22 12:00', model='gfs')

# Read a subset of the file
ds = h.xarray('HGT')

See...the print message says the file is being redownloaded from the remote! image

This may be harder to implement cleanly than I thought, because reading a subset of the downloaded data would make a copy of the full dataset. That makes redundant data copies on your local machine. I wonder how I could read the subset from the full local file based on the searchString argument without depending on wgrib2.

  1. One solution might be to not use Herbie to subset the local file, but to use cfgrib to subset the local file based on messages of interest by level and variable.
  2. Another option is to use the remote index file to subset-cURL the local file (this is how I thought the local subseting worked, but I need to double check).
karlwx commented 2 years ago

I don't think depending on wgrib2 is necessarily a bad thing. What happens if you want to open a subset of a file that doesn't have an index file associated with it? I would think generating one with wgrib2 would be the easiest way.

blaylockbk commented 2 years ago

Yep, I have thought about building in the ability to create a local index file if one is not found on the network and if wgrib2 is installed. This is a good idea.

karlwx commented 2 years ago

It would be very useful to create the local index file using wgrib2. Currently, I'm trying to download RAP historical data - one of those weird years where the index files aren't present. The download (of a subset, which is what I need) is failing due to the lack of index file. Example of a download that's failing:

from herbie.archive import Herbie
from datetime import datetime

D = datetime(2017,3,22,6)
H = Herbie(D, model='rap_historical', product='analysis', save_dir='/tmp/')
datasets = H.xarray("((HGT|UGRD|VGRD):(500|300) mb)|((HGT|UGRD|VGRD|RH|VVEL):(850|700) mb)|((UGRD|VGRD|TMP):(925) mb)|(CAPE:surface)|((MSLMA):mean sea level)")
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_246537/3588827135.py in <module>
----> 1 datasets = H.xarray("((HGT|UGRD|VGRD):(500|300) mb)|((HGT|UGRD|VGRD|RH|VVEL):(850|700) mb)|((UGRD|VGRD|TMP):(925) mb)|(CAPE:surface)|((MSLMA):mean sea level)")
      2 ds = xr.merge(datasets)
      3 ds

~/.conda/envs/radar/lib/python3.9/site-packages/herbie/archive.py in xarray(self, searchString, backend_kwargs, remove_grib, **download_kwargs)
    890 
    891         # Download file if local file does not exists
--> 892         local_file = self.get_localFilePath(searchString=searchString)
    893 
    894         # Only remove grib if it didn't exists before we download it

~/.conda/envs/radar/lib/python3.9/site-packages/herbie/archive.py in get_localFilePath(self, searchString)
    522         if searchString is not None:
    523             # Reassign the index DataFrame with the requested searchString
--> 524             self.idx_df = self.read_idx(searchString)
    525 
    526             # Get a list of all GRIB message numbers. We will use this

~/.conda/envs/radar/lib/python3.9/site-packages/herbie/archive.py in read_idx(self, searchString)
    558         A Pandas DataFrame of the index file.
    559         """
--> 560         assert self.idx is not None, f"No index file found for {self.grib}."
    561 
    562         if self.IDX_STYLE == "wgrib2":

AssertionError: No index file found for https://www.ncei.noaa.gov/data/rapid-refresh/access/historical/analysis/201703/20170322/rap_130_20170322_0600_000.grb2.
blaylockbk commented 2 years ago

Hi @karlwx, I'll have to find some time to look into this.

One thing to remember: wgrib2 can only make an index of a local file. If there isn't an index file on the remote, we need a local copy of the file to make the index.

In other words, this doesn't work:

wgrib2 https://nomads.ncep.noaa.gov/pub/data/nccf/com/hrrr/prod/hrrr.20220711/conus/hrrr.t00z.wrfnatf00.grib2

*** FATAL ERROR: missing input file https://nomads.ncep.noaa.gov/pub/data/nccf/com/hrrr/prod/hrrr.20220711/conus/hrrr.t00z.wrfnatf00.grib2 ***

The current main branch of Herbie does make index files "on the fly" if there is a local copy of the full file. In your case, you would need to download the full file first...

from herbie.archive import Herbie
from datetime import datetime

D = datetime(2017,3,22,6)
H = Herbie(D, model='rap_historical', product='analysis', save_dir='/tmp/')

H.download()  # Since an index file doesn't exist, we need to download the full file.

# When calling the xarray method, an index file will be made on the fly and used to search the full file (but not saved).
datasets = H.xarray("((HGT|UGRD|VGRD):(500|300) mb)|((HGT|UGRD|VGRD|RH|VVEL):(850|700) mb)|((UGRD|VGRD|TMP):(925) mb)|(CAPE:surface)|((MSLMA):mean sea level)")

It is inconvenient that we need to download the full grib file when there isn't an index file, but I haven't figured out a way to get around this.

karlwx commented 2 years ago

Thanks @blaylockbk, I appreciate it! In the mean time, I'll add some logic to simply download the full file if the index file isn't present.

karlwx commented 2 years ago

FYI, the hour 0 grib files from this time period seem to be corrupted, in addition to having missing index files. When I download the whole file and try to run wgrib2 on it, I get this error:

198:10245639:d=2017032206:TMP:surface:anl:
pdt_len: bad stat_proc ranges = 0 set to to 1
** ERROR bad grib message: Statistical Processing bad n=0 **
199:10346709:d=2017032206:ASNOW:surface::

*** FATAL ERROR (delayed): forecast time for /tmp/rap_historical/20170322/rap_130_20170322_0600_000.grb2

The hour 1 files are fine (with idx present on NCEI) so it could be an enhancement to look for the hour 1 file valid at the same time if the hour 0 file is bad.

blaylockbk commented 2 years ago

This is interesting: https://forum.mmm.ucar.edu/threads/why-do-i-see-fatal-error-statistical-processing-bad-n-0-when-using-the-wgrib2-utility-on-my-upp-output.9214/

This error message is displayed when using more recent versions of the wgrib2 utility on files for forecast zero that contain accumulated or time-averaged fields. This is due to the newer versions of wgrib2 no longer allowing for n parameter to be zero or empty.

Users can also continue to use an older version of wgrib2, with the latest known version not resulting in this error being v2.0.4.

aladakepri commented 8 months ago

@dplarson