ec-jrc / pyPoseidon

Framework for Hydrodynamic simulations
https://pyposeidon.readthedocs.io/
European Union Public License 1.2
20 stars 8 forks source link

sflux: spfh and stmp use float64 while prmsl, uwind and vwind are using float32 #157

Closed pmav99 closed 1 year ago

pmav99 commented 1 year ago
<xarray.Dataset>
Dimensions:  (time: 74, nx_grid: 118, ny_grid: 175)
Coordinates:
  * time     (time) float64 0.0 0.04167 0.08333 0.125 ... 2.917 2.958 3.0 3.042
Dimensions without coordinates: nx_grid, ny_grid
Data variables:
    prmsl    (time, nx_grid, ny_grid) float32 ...
    uwind    (time, nx_grid, ny_grid) float32 ...
    vwind    (time, nx_grid, ny_grid) float32 ...
    spfh     (time, nx_grid, ny_grid) float64 ...
    stmp     (time, nx_grid, ny_grid) float64 ...
    lon      (nx_grid, ny_grid) float64 ...
    lat      (nx_grid, ny_grid) float64 ...

AFAIK spfh and stmp are just zeroes, right?

Is there some reason we use float64 for them? If the values are just zeroes, then it may be possible to reduce the size of the sflux files by using float16 instead. Even with float32 there would be some gains.

I assume we get the float64 because this is the default value for dask.array.zeros():

https://github.com/ec-jrc/pyPoseidon/blob/edce2cddbdf29cb0019c5e52ef8a5a05bb3a8c81/pyposeidon/schism.py#L287

dask.array.zeros(1).dtype
# dtype('float64')

We can either specify the dtype directly in .zeros() or use .zeros_like(): https://docs.dask.org/en/stable/generated/dask.array.zeros_like.html

pmav99 commented 1 year ago

Figuring this out is kind of important for the ERA5 hindcast run, where the file sizes may get really big. It might have some importance on the forecast runs, too (less data transfered to/from the cluster, less data read, less RAM).

pmav99 commented 1 year ago

OK, I tested this. We can't use float16 because it is not supported by NetCDF4 library/standard. This is the traceback:

TypeError: illegal primitive data type, must be one of dict_keys(['S1', 'i1', 'u1', 'i2', 'u2', 'i4', 'u4', 'i8', 'u8', 'f4', 'f8']), got float16

Nevertheless float32 seems to be working fine. At least, I run the iceland model and the results seem to be identical. I will push the fix.

tomsail commented 1 year ago

I'll wait on @brey to get his opinion on this but I'd keep them. or at least stmp stmp is Surface Temperature. If float 32 keeps the 1st decimal, ok. spfh is specific humidity. not sure if it is useful @brey?

pmav99 commented 1 year ago

My understanding is that neither one of these two variables is being used at the moment. Nevertheless SCHISM requires them to be present in the input netcdf file. sflux_air is the relevant file: https://schism-dev.github.io/schism/master/input-output/sflux.html

Therefore we just add them in the netcdf and we put zeroes. https://github.com/ec-jrc/pyPoseidon/blob/edce2cddbdf29cb0019c5e52ef8a5a05bb3a8c81/pyposeidon/schism.py#L287-L307

By using float64 instead of float32 we doubled the size of these variables. In practical terms, it was like using an input file with 7 variables instead of 5 (40% increase). For the forecasts this is not an issue, but for the hindcast runs with the ERA5 file size (i.e. disk space) can be a limiting factor.

I believe I discussed this with George at some point in the past, but I think we were only discussing if it is possible to completely remove the variables from the input file. Not changing their data type. Nevertheless, we should indeed wait for his take on this.

pmav99 commented 1 year ago

pinging @vvoukouvalas , too.

brey commented 1 year ago

Indeed the variables are fixed by SCHISM but not needed by us, thus zero. For context see https://github.com/schism-dev/schism/issues/16.

I don't expect problems setting them as float32.