Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
754 stars 262 forks source link

Inconsistent Loading Results for netCDF4 version #1232

Closed powellb closed 1 year ago

powellb commented 1 year ago

I am trying to load a variable of data from a THREDDS URL; however, I get different results depending on the netCDF4/Python version.

The code is simply:

import netCDF4
file="https://www.star.nesdis.noaa.gov/thredds/dodsC/swathNPPVIIRSNRTL2PWW00/2023/041/20230210123000-STAR-L2P_GHRSST-SSTsubskin-VIIRS_NPP-ACSPO_V2.80-v02.0-fv01.0.nc"
nc=netCDF4.Dataset(file)
err = nc.variables["sses_standard_deviation"][:]

On Linux with Python 3.6.3 and netCDF4 1.2.4, this results in a masked array with valid entries:

masked_array(data =
 [[[0.36000001430511475 0.36000001430511475 0.36000001430511475 ..., -- --
   --]
  [0.36000001430511475 0.36000001430511475 0.36000001430511475 ..., -- --
   --]
  [0.36000001430511475 0.36000001430511475 0.36000001430511475 ..., -- --
   0.36000001430511475]
  ..., 
  [0.3100000023841858 0.3400000333786011 0.3799999952316284 ...,
   0.25999999046325684 0.25999999046325684 0.25999999046325684]
  [0.33000004291534424 0.3400000333786011 0.3799999952316284 ...,
   0.25999999046325684 0.25999999046325684 0.25999999046325684]
  [0.3100000023841858 0.2800000309944153 0.36000001430511475 ...,
   0.25999999046325684 0.25999999046325684 0.25999999046325684]]],
             mask =
 [[[False False False ...,  True  True  True]
  [False False False ...,  True  True  True]
  [False False False ...,  True  True False]
  ..., 
  [False False False ..., False False False]
  [False False False ..., False False False]
  [False False False ..., False False False]]],
       fill_value = -128)

However, running this same code on several instances of Linux and macOS configurations installed via condaforge (each using Python 3.10.8/netCDF4 1.6.2 and Python 3.11.0/netCDF4 1.6.2 in different virtual environments) produce invalid results. The err variable is entirely masked with invalid entries.

masked_array(
  data=[[[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]]],
  mask=[[[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True]]],
  fill_value=128,
  dtype=float32)

The variable of interest is defined as an int8 with a scale_factor of 0.01. Entries outside of ±127 would be invalid for a signed 8-bit number.

nc.variables["sses_standard_deviation"] is:

<class 'netCDF4._netCDF4.Variable'>
int8 sses_standard_deviation(time, nj, ni)
    _Unsigned: false
    add_offset: 1.0
    comment: Standard deviation of sea_surface_temperature from SST measured by drifting buoys. Further information at (Petrenko et al., JTECH, 2016; doi:10.1175/JTECH-D-15-0166.1)
    coordinates: lon lat
    long_name: SSES standard deviation
    scale_factor: 0.01
    units: kelvin
    valid_max: 127
    valid_min: -127
    _FillValue: -128
    coverage_content_type: qualityInformation
    _ChunkSizes: [   1 1536 3200]
unlimited dimensions: 
current shape = (1, 5392, 3200)
filling off

Examining err.data in the non-working cases all reveal values greater than 127 that would imply netCDF4 is not treating this as a signed int8; however, since all values listed below are greater than 127, it would also mean every standard deviation is negative. As shown by the working version at the top, the int8 values for this dataset should typically be in the 30-40 range prior to scale_factor.

>>> err.data
array([[[192., 192., 192., ..., 128., 128., 128.],
        [192., 192., 192., ..., 128., 128., 128.],
        [192., 192., 192., ..., 128., 128., 192.],
        ...,
        [187., 190., 194., ..., 182., 182., 182.],
        [189., 190., 194., ..., 182., 182., 182.],
        [187., 184., 192., ..., 182., 182., 182.]]], dtype=float32)

Is it possible that there is an issue in 1.6.2 that mishandles type int8?

jswhit commented 1 year ago

Looks to me like they are all masked because they are outside the range of [valid_min,valid_max]. If you don't want any values to be masked, you can use set_auto_mask.

jswhit commented 1 year ago

Oh - I see your point, the value in the file is actually -64 but it is apparently being interpreted as an unsigned int8 and given a value of 192. This is because the _Unsigned attribute exists - even though it is set to 'false' the module only checks for it's existence, not it's value. There is a discussion of this at https://github.com/Unidata/netcdf4-python/issues/656.

jswhit commented 1 year ago

This could be considered a bug - we should probably check to see if it is set to "false" or "False" (or alternatively, only treat variable as unsigned if _Unsigned is "true" or "True").

powellb commented 1 year ago

Thank you for the followup. Unfortunately, set_auto_mask isn't a workaround because, as you mention, it is not treating the original type properly.

I just did a quick test and cloned the issue1232 branch, but I couldn't get it to compile (due to the script not generating src/netCDF4/_netCDF4.c). I'll try to work through the compilation issue and test out the fix.

jswhit commented 1 year ago

In addition to turning the masking off, you could use the numpy view method to construct a view of the data as a signed int8.

powellb commented 1 year ago

Thank you, yes, the view will indeed cast it, but first the scale_factor and add_offset must be removed:

Extending the original code snippet from the first message:

nc.set_auto_mask(False)
err = nc.variables["sses_standard_deviation"][:]
err = ((err-1)*100).astype('int8')*.01 + 1

This results in the proper values.

For the life of me (on both Linux and macOS), I cannot get issue1232 to 'build'. It never generates the _netCDF4.c file. Everything seems to work except that it doesn't generate the c-language file. I found an issue from years ago about disabling cython; however, looking at the setup.py, it seems to want it.

I apologize for the hassle.

% python setup.py build
reading from setup.cfg...
Package hdf5 was not found in the pkg-config search path.
Perhaps you should add the directory containing `hdf5.pc'
to the PKG_CONFIG_PATH environment variable
Package 'hdf5', required by 'virtual:world', not found
using /share/apps/netcdf-4.9.0-gnu/bin/nc-config...
checking /share/apps/netcdf-4.9.0-gnu/include ...
hdf5 headers not found in /share/apps/netcdf-4.9.0-gnu/include
nc-config did provide path to HDF5 headers, search standard locations...checking /share/apps/hdf5-1.14.0-gnu/include ...
HDF5 library version: 1.14.0 headers found in /share/apps/hdf5-1.14.0-gnu/include
HDF5 library version: 1.14.0 found in /share/apps/hdf5-1.14.0-gnu/
using netcdf library version b'4.9.0'
using Cython to compile netCDF4.pyx...
netcdf lib has group rename capability
netcdf lib has nc_inq_path function
netcdf lib has nc_inq_format_extended function
netcdf lib has nc_open_mem function
netcdf lib has nc_create_mem function
netcdf lib has cdf-5 format capability
netcdf lib has netcdf4 parallel functions
netcdf lib does not have pnetcdf parallel functions
netcdf lib has bit-grooming/quantization functions
netcdf lib has zstandard compression functions
netcdf lib has bzip2 compression functions
netcdf lib has blosc compression functions
netcdf lib does not have szip compression functions
netcdf lib has nc_set_alignment function
netcdf lib has nc_inq_filter_avail function
NETCDF_PLUGIN_DIR not set, no netcdf compression plugins installed
/share/apps/miniforge3/envs/pacioos/lib/python3.11/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
  warnings.warn(msg, _BetaConfiguration)
running build
running build_py
running egg_info
writing src/netCDF4.egg-info/PKG-INFO
writing dependency_links to src/netCDF4.egg-info/dependency_links.txt
writing entry points to src/netCDF4.egg-info/entry_points.txt
writing requirements to src/netCDF4.egg-info/requires.txt
writing top-level names to src/netCDF4.egg-info/top_level.txt
reading manifest file 'src/netCDF4.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files found matching 'examples/data'
warning: no previously-included files found matching 'src/netCDF4/_netCDF4.c'
adding license file 'LICENSE'
writing manifest file 'src/netCDF4.egg-info/SOURCES.txt'
running build_ext
building 'netCDF4._netCDF4' extension
gcc -pthread -B /share/apps/miniforge3/envs/pacioos/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /share/apps/miniforge3/envs/pacioos/include -fPIC -O2 -isystem /share/apps/miniforge3/envs/pacioos/include -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -I/share/apps/netcdf-4.9.0-gnu/include -I/share/apps/hdf5-1.14.0-gnu/include -I/share/apps/miniforge3/envs/pacioos/lib/python3.11/site-packages/numpy/core/include -Iinclude -I/share/apps/miniforge3/envs/pacioos/include/python3.11 -c src/netCDF4/_netCDF4.c -o build/temp.linux-x86_64-cpython-311/src/netCDF4/_netCDF4.o
gcc: error: src/netCDF4/_netCDF4.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command '/opt/ohpc/pub/compiler/gcc/9.4.0/bin/gcc' failed with exit code 1
jswhit commented 1 year ago

can you run cython manually on src/netCDF4/_netCDF4.pyx?

powellb commented 1 year ago

I realized that the environment I was building in didn't have cython installed.

I built the issue1232 branch and re-ran my tests, and it works now. Thank you!