NOAA-ORR-ERD / gridded

A single API for accessing / working with gridded model results on multiple grid types
https://noaa-orr-erd.github.io/gridded/index.html
The Unlicense
64 stars 14 forks source link

SanDiego.nc file throws error in load_arbitrary_ugrid.py example script #56

Open gewitterblitz opened 4 years ago

gewitterblitz commented 4 years ago

Hi I am new to gridded. I was trying to replicate the load_arbitrary_ugrid.py example script but could not load the SanDiego.nc file. I tried downloading the file to the working directory as well as accessing directly through the provided url using both netCDF4 and xarray libraries.

Approach 1:

with netCDF4.Dataset("SanDiego.nc") as nc:

# need to convert to zero-indexing
nodes = nc.variables['nodes'][:] - 1
faces = nc.variables['E3T'][:, :3] - 1

OSError Traceback (most recent call last)

in ----> 1 with netCDF4.Dataset("SanDiego.nc") as nc: 2 3 # need to convert to zero-indexing 4 nodes = nc.variables['nodes'][:] - 1 5 faces = nc.variables['E3T'][:, :3] - 1 netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__() netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success() OSError: [Errno -51] NetCDF: Unknown file format: b'SanDiego.nc'` **Approach 2:** `url = ('https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc#mode=bytes')` `dataset = netCDF4.Dataset(url)` --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) in 1 url = ('https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc#mode=bytes' class="ansi-blue-fg">) ----> 2 dataset = netCDF4.Dataset(url) netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__() netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success() FileNotFoundError: [Errno 2] No such file or directory: b'https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc#mode=bytes' **Approach 3:** `ds = xr.open_dataset(url)` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/.conda/envs/cent7/5.3.1-py37/arps/lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 197 try: --> 198 file = self._cache[self._key] 199 except KeyError: ~/.conda/envs/cent7/5.3.1-py37/arps/lib/python3.8/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key) 52 with self._lock: ---> 53 value = self._cache[key] 54 self._cache.move_to_end(key) KeyError: [, ('https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc#mode=bytes',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]
Huite commented 4 years ago

Hi,

I just tried it locally, the file opens fine for me.

I'm guessing something went wrong during your download. We could find out with a checksum:

import hashlib

def md5_hash(path: str) -> str:
    with open(path, "rb") as f:
        content = f.read()
    return hashlib.md5(content).hexdigest()

print(md5_hash("SanDiego.nc"))
# prints 1ed883e7318883ef654c123106ed09c0

Did you already try re-downloading the netCDF file? (Lame suggestion I know, but sometimes the solution is mundane!)

(By the way, ordinary URLs will not work with the netCDF4 library, only an OPeNDAP URL will work: https://www.opendap.org/)

ChrisBarker-NOAA commented 4 years ago

Works for me, too.

I think your file got corrupted, or your netcdf lib is somehow broken.

You might try running ncdump on the file to test it out:

$ ncdump -h SanDiego.nc

-CHB

gewitterblitz commented 4 years ago

Huite and Chris,

Yes, I did try redownloading the file but still getting the same error. The checksum changes every time I download the file.

Here is how I am downloading it within my jupyter notebook on university cluster (although same result on local machine):

! wget https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc

which gives the following output:

wget: /apps/spack/rice/apps/anaconda/5.3.1-py37-gcc-4.8.5-7vvmykn/lib/libuuid.so.1: no version information available (required by wget)
--2020-08-07 02:10:58--  https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘SanDiego.nc’

    [ <=>                                   ] 81,848      --.-K/s   in 0.05s   

2020-08-07 02:10:59 (1.68 MB/s) - ‘SanDiego.nc’ saved [81848]

Here is the checksum code output:

import hashlib

def md5_hash(path: str) -> str:
    with open(path, "rb") as f:
        content = f.read()
    return hashlib.md5(content).hexdigest()

print(md5_hash("SanDiego.nc"))
# prints 4ab9b0548e4ed555970d71ee2238a5c2

And here is the same error using the example script:

from datetime import datetime, timedelta
import gridded
import netCDF4

with netCDF4.Dataset("SanDiego.nc") as nc:

    # need to convert to zero-indexing
    nodes = nc.variables['nodes'][:] - 1
    faces = nc.variables['E3T'][:, :3] - 1

    # make the grid
    # gridded.grids.Grid_U
    grid = gridded.grids.Grid_U(nodes=nodes,
                                faces=faces,
                                )

    # make the time object (handles time interpolation, etc)
    times_var = nc.variables['times'][:]

    # Time axis needs to be a list of datetime objects.
    # If the meta data are not there in the netcdf file, you have to do it by hand.
    start = datetime(2019, 1, 1, 12)
    times = [start + timedelta(seconds=val) for val in times_var]

    # This isn't a compliant file, so this will not work.
    # time_obj = gridded.time.Time.from_netCDF(dataset=nc,
    #                                          varname='times')

    time_obj = gridded.time.Time(data=times,
                                 filename=None,
                                 varname=None,
                                 tz_offset=None,
                                 origin=None,
                                 displacement=timedelta(seconds=0),)

    # make the variables
    depth = nc.variables['Depth']

@ChrisBarker-NOAA : ncdump does not work work ncdump: SanDiego.nc: SanDiego.nc: NetCDF: Unknown file format

Am I downloading from the right weblink?

@Huite : Good to know the OPEnDAP trick.

Huite commented 4 years ago

Ah, you were using wget, so yes you're on the right track: you're not downloading what you think you are downloading, because of how Github works. You've been getting HTML pages instead of the netCDF file (open your SanDiego.nc with an editor to see for yourself).

See: https://unix.stackexchange.com/questions/228412/how-to-wget-a-github-file

This'll do the trick:

wget https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc?raw=true -O SanDiego.nc

(I didn't know this either before now.)

You're probably on a *nix machine (if you're using wget), so you can check immediately:

md5sum SanDiego.nc

I got a little curious, you can get the data into Python without loading into a file, provided you once again give the right URL: https://github.com/Unidata/netcdf4-python/issues/295

import requests
import netCDF4

my_url = "https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc?raw=true"
response = requests.get(my_url, stream=True)
ds = netCDF4.Dataset('name', mode='r', memory=response.content)

OPenDAP is probably lot fancier, but it's pretty cool this works.

Xarray dispatches based on the type you're passing in, in which case you need h5py as an available backend to read from memory, and provide a BytesIO object.

import io
import requests
import xarray as xr

response = requests.get("https://github.com/erdc/AdhModel/blob/master/tests/test_files/SanDiego/SanDiego.nc?raw=true")
ds = xr.open_dataset(io.BytesIO(response.content))
ChrisBarker-NOAA commented 4 years ago

@Huite: thanks for beeting me to it!

and the trick about xarray -- another good reason to build the next version of gridded on it :-)

-CHB

gewitterblitz commented 4 years ago

@Huite : Thank you so much, it works!!!

I had no idea about the wget issue with github files. Your suggestion for loading through url is really helpful and works just great!

I am currently trying out gridded for post-processing the output from an atmospheric NWP model. Will let you guys know if I need any help.

Any words of wisdom from your end are highly appreciated. I found out about gridded from @ChrisBarker-NOAA's AMS 2017 talk.

gewitterblitz commented 4 years ago

@ChrisBarker-NOAA @Huite : What's the best way to reach out to you to discuss gridded's application to a meteorological numerical model output?

ChrisBarker-NOAA commented 4 years ago

Actually gitHub is a pretty good way to do it.

Why not start a new issue(s) with a question or proposal?