NOAA-ORR-ERD / gridded

A single API for accessing / working with gridded model results on multiple grid types
https://noaa-orr-erd.github.io/gridded/index.html
The Unlicense
64 stars 14 forks source link

Support for MFDataset #16

Open acrosby opened 6 years ago

acrosby commented 6 years ago

Is it difficult to provide MF support in gridded using the underlying NetCDF4.MFDatatset method?

ChrisBarker-NOAA commented 6 years ago

Hey! I just looked, it's already there! See gridded.utilities nice time machine @jay-hennen :-)

def get_dataset(ncfile, dataset=None):
    """
    Utility to create a netCDF4 Dataset from a filename, list of filenames,
    or just pass it through if it's already a netCDF4.Dataset

    if dataset is not None, it should be a valid netCDF4 Dataset object,
    and it will simiply be returned
    """
    if dataset is not None:
        return dataset
    if isinstance(ncfile, nc4.Dataset):
        return ncfile
    elif isinstance(ncfile, collections.Iterable) and len(ncfile) == 1:
        return nc4.Dataset(ncfile[0])
    elif isstring(ncfile):
        return nc4.Dataset(ncfile)
    else:
        return nc4.MFDataset(ncfile)

@acrosby: could you test and close this issue if it works.

I'll update the docstring in the gridded.Dataset constructor.

acrosby commented 6 years ago

Oh cool I’ll give it a try and report back. On Wed, Jan 17, 2018 at 6:36 PM Chris Barker notifications@github.com wrote:

Hey! I just looked, it's already there! See gridded.utilities nice time machine @jay-hennen https://github.com/jay-hennen :-)

def get_dataset(ncfile, dataset=None): """ Utility to create a netCDF4 Dataset from a filename, list of filenames, or just pass it through if it's already a netCDF4.Dataset

if dataset is not None, it should be a valid netCDF4 Dataset object,
and it will simiply be returned
"""
if dataset is not None:
    return dataset
if isinstance(ncfile, nc4.Dataset):
    return ncfile
elif isinstance(ncfile, collections.Iterable) and len(ncfile) == 1:
    return nc4.Dataset(ncfile[0])
elif isstring(ncfile):
    return nc4.Dataset(ncfile)
else:
    return nc4.MFDataset(ncfile)

@acrosby https://github.com/acrosby: could you test and close this issue if it works.

I'll update the docstring in the gridded.Dataset constructor.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-ORR-ERD/gridded/issues/16#issuecomment-358486079, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0zvDarIe91OTmdwUnACZ-OsWiMMNrqks5tLoPfgaJpZM4Rhg67 .

acrosby commented 6 years ago

I am getting a segfault trying to use the gridded.Dataset constructor, but gridded.utilities.get_dataset seems to return a functional netCDF4.MFDataset as expected.

ChrisBarker-NOAA commented 6 years ago

OK, that's weird. All the Dataset init does is call:

self.nc_dataset = get_dataset(ncfile)

segfaults are a pain to debug, but any idea how far it gets before the crash?

-CHB

jay-hennen commented 6 years ago

All you should need to do to get it to create a netCDF4.MFDataset instead of a netCDF4.Dataset is to pass a list of filenames in place of a filename. If it's segfaulting in the C library, the first step is to make sure the API on this end is being used as expected.

acrosby commented 6 years ago

This is what is seg-faulting on my machine, but gridded.utilities.get_dataset(f) works and returns an MFDataset. Is this how you are indicating the constructor is used? In this case, its just two small files I used to test.

f = glob.glob("nos.cbofs.fields.*.20180117.*.nc")
nc = gridded.Dataset(f)

Here is what I have installed: appdirs==1.4.2 archivemail==0.9.0 awscli==1.14.3 backports-abc==0.5 backports.functools-lru-cache==1.4 backports.shutil-get-terminal-size==1.0.0 backports.ssl-match-hostname==3.5.0.1 basemap==1.0.7 beautifulsoup4==4.4.1 bleach==1.5.0 bokeh==0.12.5 botocore==1.8.7 branca==0.2.0 Cartopy==0.14.2 cell-tree2d==0.3.0 certifi==2017.4.17 chardet==3.0.3 Cheetah==2.4.4 click==6.7 click-plugins==1.0.3 cligj==0.4.0 cloudpickle==0.3.1 colorama==0.3.7 colorcet==0.9.1 configparser==3.5.0 cryptography==1.2.3 cvxopt==1.1.4 cycler==0.10.0 Cython==0.27.3 dap==2.2.6.7 dask==0.15.4 datashader==0.5.0 datashape==0.5.2 decorator==4.0.11 defusedxml==0.4.1 descartes==1.1.0 dnspython==1.12.0 docker==2.3.0 docker-pycreds==0.2.1 docutils==0.14 eccodes==2.6.0 entrypoints==0.2.2 enum34==1.1.6 feedparser==5.1.3 Fiona==1.7.8 flake8==3.3.0 Flask==0.12.2 -e git+git@github.com:acrosby/folium.git@ad1dd1a9089ee3c0ff0d4217b853e4f6725d5f79#egg=folium FormEncode==1.3.0 funcsigs==1.0.2 functools32==3.2.3.post2 futures==3.2.0 GDAL==1.11.3 geojson==1.3.4 geopandas==0.2.1 geoviews==1.2.0 gevent==1.2.2 gpodder==3.9.0 greenlet==0.4.12 grequests==0.3.0 gridded==0.0.9 h5py==2.6.0 holoviews==1.7.0 html5lib==0.9999999 httplib2==0.9.1 idna==2.5 ipaddress==1.0.16 ipdb==0.10.2 ipykernel==4.5.2 ipyleaflet==0.2.1 ipython==5.3.0 ipython-genutils==0.1.0 ipywidgets==5.2.2 itsdangerous==0.24 jdcal==1.0 jedi==0.9.0 Jinja2==2.9.5 jmespath==0.9.3 joblib==0.9.4 jsonschema==2.6.0 julia==0.1.4 julian==0.14 jupyter==1.0.0 jupyter-client==5.0.0 jupyter-console==5.1.0 jupyter-core==4.3.0 keyring==7.3 llvmlite==0.18.0 lxml==3.5.0 Magic-file-extensions==0.2 Markdown==2.6.8 MarkupSafe==0.23 matplotlib==2.1.1 mccabe==0.6.1 metakernel==0.20.2 mistune==0.7.3 multipledispatch==0.4.9 munch==2.1.1 mygpoclient==1.7 nbconvert==5.1.1 nbformat==4.3.0 netCDF4==1.2.2 networkx==1.11 nose==1.3.7 notebook==4.4.1 npyscreen==4.10.5 numba==0.33.0 numexpr==2.6.4 numpy==1.14.0 octave-kernel==0.26.2 odo==0.5.0 openpyxl==2.3.0 packaging==16.8 pandas==0.21.0 pandocfilters==1.4.1 param==1.5.1 Paste==1.7.5.1 PasteDeploy==1.5.2 PasteScript==1.7.5 pathlib2==2.2.1 patsy==0.4.1 pdfkit==0.6.1 pexpect==4.2.1 pickleshare==0.7.4 Pillow==3.1.2 prompt-toolkit==1.0.13 psycopg2==2.6.1 ptyprocess==0.5.1 py==1.4.31 pyasn1==0.4.2 pycodestyle==2.3.1 pycrypto==2.6.1 pyephem==3.7.6.0 pyflakes==1.5.0 pygc==1.1.0 Pygments==2.2.0 pygobject==3.20.0 pyliblzma==0.5.3 pyOpenSSL==0.15.1 pyparsing==2.2.0 pyproj==1.9.5.1 pysgrid==0.3.5 pyshp==1.2.10 pyspatialite==3.0.1 pytest==2.8.7 python-dateutil==2.6.1 python-openid==2.2.5 pytz==2017.3 pyugrid==0.2.2 pyxdg==0.25 PyYAML==3.12 pyzmq==16.0.2 qtconsole==4.2.1 ranger==1.7.1 requests==2.17.3 rsa==3.4.2 s3transfer==0.1.12 scandir==1.5 scgi==1.13 scipy==1.0.0 seaborn==0.8.1 SecretStorage==2.1.3 Shapely==1.5.17.post1 simplegeneric==0.8.1 simplejson==3.8.1 singledispatch==3.4.0.3 six==1.11.0 SOAPpy==0.12.22 statsmodels==0.8.0 subprocess32==3.2.7 tables==3.2.2 Tempita==0.5.2 terminado==0.6 testpath==0.3 toolz==0.8.2 tornado==4.4.2 traitlets==4.3.2 urllib3==1.21.1 uTidylib==0.2 wcwidth==0.1.7 websocket-client==0.40.0 Werkzeug==0.12.2 widgetsnbextension==1.2.6 wstools==0.4.3 xarray==0.9.1 xlrd==0.9.4 xlwt==0.7.5

Python 2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609] on linux2 NetCDF Version 4.4.0 HDF5 Version 1.8.16

jay-hennen commented 6 years ago

Hmm. This might be related, somehow:

"The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order."

Are the filenames sequential ascending? It may not work otherwise

ChrisBarker-NOAA commented 6 years ago

well, it's arbitrary, but usually alphabetical anyway -- though it won't hurt to add a sorted() call.

But the mystery here is that it only fails when the Dataset initializer is used -- and all that does is pass the list on to `gridded.utilities.get_dataset.

I suspect this is a bit of red herring -- segfaults are wierd, and probably coming from netCDF4 -- and maybe fails in one case rather than the other due to how memory happens to be arranged in the app when the actual problem occurs :-(

@acrosby: can you point us to your data, so we can see if we can repeat it?

and/or try another platform or different version of netCDF4 ?

-CHB

acrosby commented 6 years ago

I'm just using the COOPS coastal models from the production forecasts. I'll dive a little deeper and see if I can narrow down what is going on.

ChrisBarker-NOAA commented 6 years ago

COOPS doesn't always get aggregation right :-( I'd take a look at the files.

FWIW, we process everything through our GOODS system to make sure it's compatible with our code (and other normalizations). You are welcome to give that a try.

https://gnome.orr.noaa.gov/goods

most of these will work with Gridded, except some of the wierd grids, like CBOFS :-( -- we have that working with kludges in our older, pre-gridded code.

acrosby commented 6 years ago

I'm looking at individual files for each forecast tau on my local system though, doesn't really explain why I can do MFDataset by itself and work with the file.

jay-hennen commented 6 years ago

This may actually be cell_tree segfaulting. Are we sure the grid is kosher? It can't deal with overlapping cells, so if this file is like CBOFS things won't go well (segfaults, undefined behavior).

I can't think of a reason WHY it would segfault during Dataset creation, since usually the celltree would only be created when necessary, but there might be some case I can't recall that triggers creation of the cell_tree.

Try using the files to create a Grid and a Variable using `.from_netCDF()'. If they segfault there as well, it's more than likely netCDF. In both those cases, the celltree should not be created and should therefore not segfault.

acrosby commented 6 years ago

That sounds possible (I was using CBOFS), although doesn't cell_tree get build when I read the files individually too?

jay-hennen commented 6 years ago

That's why I suggested trying to use Grid.from_netCDF. I put in effort to avoid creating cell trees generally during object initialization for this exact reason, but it seems like I compromised at some point. I think the Depth objects specifically the S_Depth, does need to use the cell tree on creation, so that's where it would be coming from if the file uses sigma depths. If you try Grid.from_netCDF and it doesn't segfault, then I would bet on celltree being the culprit here. It's guaranteed that Grid initialization doesn't create a cell tree.

acrosby commented 6 years ago

Alright I'll give that a try.

FWIW I see the same error with LEOFS (Ugrid/FVCOM) and GLOFS (RGrid/POM (I think...)).

acrosby commented 6 years ago

Passing the MFDataset into gridtest = Grid.from_netCDF(dataset=... , ...) and then Variable.from_netCDF(dataset=... , grid=gridtest, ...) does seem to work without segfaults.

jay-hennen commented 6 years ago

Hmm, that's interesting. Can you try it by passing the glob to the filename kwarg? That should also work, using get_dataset to make the MFDataset. Also, verify that the variable you chose has a depth (just check the .depth attribute.

If all of the above works, then it's more than likely netCDF is the culprit.

acrosby commented 6 years ago
f = glob.glob("glofs*nowcast*.nc")[:3]

# This works (var without depth)
gridtest = gridded.Grid.from_netCDF(filename=f, name='zeta', varname='zeta')
gridded.Variable.from_netCDF(filename=f, name='zeta`, varname='zeta', grid=gridtest)

# Variable with depth errors (but no segfault) on second line
gridtest = gridded.Grid.from_netCDF(filename=f, name='u', varname='u')
gridded.Variable.from_netCDF(filename=f, name='u', varname='u', grid=gridtest)

Error:

/usr/local/lib/python2.7/dist-packages/gridded/variable.pyc in from_netCDF(cls, filename, varname, grid_topology, name, units, time, time_origin, grid, depth, dataset, data_file, grid_file, load_all, fill_value, **kwargs)
    213                     isinstance(grid, Grid_U) and len(data.shape) == 3):
    214                 depth = Depth.from_netCDF(grid_file,
--> 215                                           dataset=dg,
    216                                           )
    217 #             if len(data.shape) == 4 or (len(data.shape) == 3 and time is None):

/usr/local/lib/python2.7/dist-packages/gridded/depth.pyc in from_netCDF(filename, dataset, depth_type, varname, topology, _default_types, *args, **kwargs)
    450         if (depth_type is None or isinstance(depth_type, string_types) or
    451                                  not issubclass(depth_type, DepthBase)):
--> 452             cls = Depth._get_depth_type(df, depth_type, topology, _default_types)
    453 
    454         return cls.from_netCDF(filename=filename,

/usr/local/lib/python2.7/dist-packages/gridded/depth.pyc in _get_depth_type(dataset, depth_type, topology, _default_types)
    498         else:
    499             try:
--> 500                 L_Depth.from_netCDF(dataset=dataset)
    501                 return L_Depth
    502             except ValueError:

/usr/local/lib/python2.7/dist-packages/gridded/depth.pyc in from_netCDF(cls, filename, dataset, name, topology, terms, surface_index, bottom_index, **kwargs)
    129                 vname=tn
    130                 if tn not in dataset.variables.keys():
--> 131                     vname = cls._gen_varname(filename, dataset, [tn], [tln])
    132                 terms[tn] = dataset[vname][:]
    133         if surface_index is None:

/usr/local/lib/python2.7/dist-packages/gridded/depth.pyc in _gen_varname(cls, filename, dataset, names_list, std_names_list)
     81             for var in df.variables.values():
     82                 if hasattr(var, 'standard_name') or hasattr(var, 'long_name'):
---> 83                     if var.name == n:
     84                         return n
     85         raise ValueError("Default names not found.")

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._Variable.__getattr__ (netCDF4/_netCDF4.c:67175)()

AttributeError: name
acrosby commented 6 years ago

Passing the string with the pattern itself yields RuntimeError: No such file or directory from both get_dataset and the from_netCDF methods.

Also the above from_netCDF method throws a different error on CBOFS (rather than the more conventional grid i was looking at above), for both u (has depth) and zeta (surface), which is a little weird because the single files at least can be converted into a dataset, and u/zeta into variables:

...
---> 57 grid = gridded.Grid.from_netCDF(filename=f, name='u', varname='u')
     58 ssh = gridded.Variable.from_netCDF(filename=f, name='u', varname='u', grid=grid)
     59 bbox = get_bbox(grid)

/usr/local/lib/python2.7/dist-packages/gridded/grids.pyc in from_netCDF(filename, dataset, grid_type, grid_topology, _default_types, *args, **kwargs)
    419         compliant = Grid._find_topology_var(None, gf)
    420         if compliant is not None:
--> 421             c = Grid._load_grid(filename, cls, dataset)
    422             c.grid_topology = compliant.__dict__
    423         else:

/usr/local/lib/python2.7/dist-packages/gridded/grids.pyc in _load_grid(filename, grid_type, dataset)
    384         elif issubclass(grid_type, SGrid):
    385             ds = get_dataset(filename, dataset)
--> 386             g = grid_type.load_grid(ds)
    387             g.filename = filename
    388             return g

/usr/local/lib/python2.7/dist-packages/gridded/pysgrid/sgrid.pyc in load_grid(cls, nc)
    100         else:
    101             nc = Dataset(nc, 'r')
--> 102         topology_var = find_grid_topology_var(nc)
    103         sa = SGridAttributes(nc, cls.topology_dimension, topology_var)
    104         dimensions = sa.get_dimensions()

/usr/local/lib/python2.7/dist-packages/gridded/pysgrid/read_netcdf.pyc in find_grid_topology_var(nc)
     25 
     26     """
---> 27     grid_topology = nc.get_variables_by_attributes(cf_role='grid_topology')
     28 
     29     if not grid_topology:

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.get_variables_by_attributes (netCDF4/_netCDF4.c:20945)()

TypeError: 'NoneType' object is not iterable