csiro-coasts / emsarray

xarray extension that supports EMS model formats
BSD 3-Clause "New" or "Revised" License
13 stars 2 forks source link

Fix masking clipped, scaled, and masked integer data array #4

Closed mx-moth closed 2 years ago

mx-moth commented 2 years ago

When clipping the temp variable in the bran2020 dataset, out-of-bounds values are not maksed correctly. The temp variable is a short with float offset and scale parameters. Masked values (such as over land) are marked using the _FillValue: -32768s in the netCDF file. Masked values in the xarray dataset are nans. Clipped values should be nans also to match.

This PR fixes the below bug by using np.nan for the fill value of already masked data arrays. It needs tests before being merged.

import emsarray
import shapely.geometry
import tempfile

ds = emsarray.tutorial.open_dataset("bran2020")
tasmania_clip = shapely.geometry.Polygon([
    (141.459, -40.780),
    (142.954, -39.198),
    (149.106, -39.095),
    (150.864, -41.376),
    (149.809, -44.621),
    (144.843, -45.706),
    (141.723, -43.389),
    (141.372, -40.913),
])

with tempfile.TemporaryDirectory() as temp_name:
    tassie = ds.ems.clip(tasmania_clip, temp_name)
    tassie.load()

tassie.ems.plot(tassie["temp"].isel(Time=0, st_ocean=0))

bad_clip

After fixing the bug, the above code makes this plot instead:

good_clip

Masked vs. clipped data:

>>> # The corner of the dataset which has been masked out incorrectly
>>> print(tassie["temp"][0, 0, 0, 0].values)
array(383.0099, dtype=float32)
>>> # Over land, which was masked when the file was loaded
>>> tassie["temp"][0, 0, 35, 50].values
array(nan, dtype=float32)

$ ncdump -h bran2020.nc:

netcdf bran2020 {
dimensions:
    Time = UNLIMITED ; // (20 currently)
    xt_ocean = 100 ;
    yt_ocean = 100 ;
    st_ocean = 30 ;
    nv = 2 ;
variables:
    ...
    short temp(Time, st_ocean, yt_ocean, xt_ocean) ;
        temp:_FillValue = -32768s ;
        temp:long_name = "Potential temperature" ;
        temp:units = "degrees C" ;
        temp:valid_range = -32767s, 32767s ;
        temp:packing = 4 ;
        temp:cell_methods = "time: mean Time: mean" ;
        temp:time_avg_info = "average_T1,average_T2,average_DT" ;
        temp:coordinates = "geolon_t geolat_t" ;
        temp:standard_name = "sea_water_potential_temperature" ;
        temp:_ChunkSizes = 1, 1, 300, 300 ;
        temp:add_offset = 245.f ;
        temp:scale_factor = 0.00778222f ;
        temp:missing_value = -32768s ;
    ...
}