MASSIVE bug in downscaling

ArcticSnow / TopoPyScale

TopoPyScale: a Python library to perform simplistic climate downscaling at the hillslope scale

https://topopyscale.readthedocs.io

MIT License

39 stars 9 forks source link

MASSIVE bug in downscaling #60

Closed ArcticSnow closed 1 year ago

ArcticSnow commented 1 year ago

With the latest commit (988a7b2) the outputs are totally off, full of noise and incoherent in time. The same output are fine with commit e30c720.

This is the unwanted output for LW

This is correct downscaling of LW for instance:

ArcticSnow commented 1 year ago

This unwanted behavior has been introduced at merge with the branch parallelizing at commit 51d2d08

ArcticSnow commented 1 year ago

I nailed where the problem is introduced. It happens within the function _subset_climate_dataset(). https://github.com/ArcticSnow/TopoPyScale/blob/6a4c68e9a0e2f9f1c66427924e75d8c126fbd963/TopoPyScale/topo_scale.py#L182

When this function is called by pool.starmap(fun, fun_param) it messes things around.

ArcticSnow commented 1 year ago

The problem is introduced when saving temporary dataset to disk with ds_tmp.to_netcdf(f'outputs/tmp/ds_{type}_pt_{pt_id}.nc', engine='h5netcdf')

the variable ds_tmp when plotted contains the correct values, but the netcdf files does not save correct values.

ArcticSnow commented 1 year ago

Not the first one to run into this problem: https://github.com/pydata/xarray/issues/7039

This seems to be related to how ERA5 data are encoded in netcdf. Then we need to force encoding with saving to netcdf with the following lines:

comp = dict(zlib=True, complevel=5)
encoding = {var: comp for var in ds_tmp.data_vars}
ds_tmp.to_netcdf(f'outputs/tmp/ds_{type}_pt_{pt_id}.nc', engine='h5netcdf', encoding=encoding)

This was found on stackoverflow: https://stackoverflow.com/questions/40766037/specify-encoding-compression-for-many-variables-in-xarray-dataset-when-write-to

veenstrajelmer commented 1 year ago

@ArcticSnow. A more desireable solution might be to remove the dtype key from the encoding dict. See also my answer in the stackoverflow issue you linked: https://stackoverflow.com/a/75665502/5578371

ArcticSnow commented 1 year ago

Thank you for your suggestion @veenstrajelmer. I actually had read your comment on stackoverflow but came short of figuring out how to actually do it. If you have concrete suggestion it could be interesting. The other thing is that compressing the file stored improves disk space usage. Or the technique may be to compute a new offset and convert all variable to INT. I have a method in place but not yet optimal. Please have a look at: https://github.com/ArcticSnow/TopoPyScale/blob/494f4e7ea17830ba3d23627bf22ee200a6c4f082/TopoPyScale/topo_export.py#L21.

Any suggestions on how to improve this is also welcome. I find it not so straight forward to figure these methods out. Thank you

veenstrajelmer commented 1 year ago

That is a great suggestion @ArcticSnow. I have tested your implementation and it works quite good. The only thing I notice is that there is some loss of accuracy, in the range of scale_factor/2, but this makes sense. What would you say should be improved still? I was wondering why you did not use add_offset = (vmax+vmin) / 2, but that might be my lack of experience. The difference is scale_factor/2.

I have slightly edited your funtion to derive n from da.encoding['dtype'] and use a 'ds' instead of 'daas input argument, such thatds = recompute_scaling_and_offset(ds)` works. You can find it here: https://github.com/Deltares/dfm_tools/blob/6c12fd4b4f02f23b695338019baaeabd88dbc8b7/dfm_tools/xarray_helpers.py#L147

ArcticSnow commented 1 year ago

@veenstrajelmer Nice! I think I will reuse your function with another evolution. ;)

I suspect improvements is possible as when I zip the netcdfs out of this encoding, zip is able to deflate the files by up to 56%. Maybe using int8 instead of int16 was it.

FYI, I had found the idea and original implementation here: https://stackoverflow.com/questions/70102997/scale-factor-and-add-offset-in-xarray-to-netcdf-lead-to-some-small-negative-valu

veenstrajelmer commented 1 year ago

Hmm, maybe it is indeed easier to zip it, I learned recently it makes it possible to store data as float32 in the same filesize as int16 without zipping. This gives the same filesize and is somewhat easier to implement I guess. The stackoverflow suggestion dropped all other encoding, which is not super convenient.

Here only for the msl variable, but you get the point.

ds.msl.encoding.pop('dtype')
ds.msl.encoding.pop('scale_factor')
ds.msl.encoding.pop('add_offset')
ds.msl.encoding['zlib'] = True

However, I would still expect xarray to give a warning when it is writing data to a file that is not within the valid range of the scaling_factor and add_offset.