barronh / pseudonetcdf

PseudoNetCDF like NetCDF except for many scientific format backends
GNU Lesser General Public License v3.0
76 stars 35 forks source link

copyVariable with scale and offset #114

Closed barronh closed 2 years ago

barronh commented 3 years ago

When a variable has scale and offsets that are not consistent with the stored type, the data can be truncated on copy. For example, the Aura OMNO2 CloudFraction variable is stored as a int16, but uses a floating point scale_factor (0.001) and offset (0.0). The masked/scaled values should be floating point data. In PseudoNetCDF, any copyVariable call like in subset

import PseudoNetCDF as pnc
import numpy as np

path='https://aura.gesdisc.eosdis.nasa.gov/opendap/hyrax/Aura_OMI_Level2/OMNO2.003/2020/001/OMI-Aura_L2-OMNO2_2020m0101t0117-o82246_v003-2020m0610t191058.he5'

key = 'CloudFraction'
satf = pnc.pncopen(path, format='netcdf')
subsetf = satf.subset([key])
copyf = pnc.PseudoNetCDFFile()
copyf.copyDimension(satf.dimensions['nTimes'], key='nTimes')
copyf.copyDimension(satf.dimensions['nXtrack'], key='nXtrack')
copyf.copyVariable(satf.variables[key], key=key)
print('Bad', np.unique(subsetf.variables[key][:]))
print('Bad', np.unique(copyf.variables[key][:]))
print('Good', np.unique(satf.variables[key][:]))

The subsetf and copyf variables report 0, 1 and masked, while the satf reports values from 0 to 1 by 0.001 to 1 and masked. This is the result of the new variable inheriting the dtype np.int16. The fractional values are cast as integers and the data is lost.

barronh commented 3 years ago

Using copy with keyword dtype can bypass this issue, but only works for direct copyVariable calls.

copyf.copyVariable(satf.variables[key], key=key, dtype='f')
barronh commented 2 years ago

One option is to have copyVariable use the dtype of the values when data will be copied, but not when it won't. On the upside, data values will be correctly copied and if written, written correctly. The only con that I see is that subsequent data storage would use the longer data type.

Using the OMI example file variable as an example, the variable is stored as an signed short integer (int16). The scaled values are 32-bit floats. This doubles the size of the file.

Any other treatment would require storing backend data types and adding coordination between the netCDF4.Dataset.set_auto_maskandscale methods. As a result, I think the downside is more than made up for by the data value conservation.

barronh commented 2 years ago

Updating copyVariable fixes the copyf instance, but not the subsetf. For some reason, subset was calling copyVariable with withdata=False, and then adding the data in subset. This was identical to calling copyVariable withdata=True, so I have changed it to do so. After both changes, copyf and subsetf have the same values as satf.