c-scale-community / workflow-coastal-hydrowaq

Porting and deploying the HiSea use case on C-SCALE
Apache License 2.0
3 stars 1 forks source link

problem with time units encoding when writing netcdf using xarray #32

Closed backeb closed 1 year ago

backeb commented 1 year ago

in e.g. cmems_bottomT.nc the time variable has the following

int64 time(time) ;
                time:long_name = "Time (hours since 1950-01-01)" ;
                time:standard_name = "time" ;
                time:axis = "T" ;
                time:_CoordinateAxisType = "Time" ;
                time:valid_min = 633324.f ;
                time:valid_max = 633324.f ;
                time:units = "days since 2022-04-01 12:00:00" ; ⬅️ note this ❗  
                time:calendar = "proleptic_gregorian" ;

This seems to be related to a problem in download_cmems_*.py, probably lines

ds = xr.open_mfdataset('/data/tmp/cmems_'+var+'_*.nc')
ds.to_netcdf('/data/cmems_'+var+'.nc')

Can be fixed with e.g. ds.time.encoding['units'] = 'hours since 1950-01-01 00:00:00' or ds.to_netcdf('file_splitted_2005.nc','time':{'unit s': "seconds since 1900-01-01 00:00:00"}})

backeb commented 1 year ago

in /home/centos/data/download/tmp doing

$ ds_mf = xr.open_mfdataset('cmems_bottomT_*.nc')    
$ ds_mf.to_netcdf('../test1.nc')                                                                                                                                            

and checking the output using ncdump

$ ncdump -h test1.nc 
int64 time(time) ;
                time:long_name = "Time (hours since 1950-01-01)" ;
                time:standard_name = "time" ;
                time:axis = "T" ;
                time:_CoordinateAxisType = "Time" ;
                time:units = "days since 2022-04-01 12:00:00" ; ❗ 
                time:calendar = "proleptic_gregorian" ;

The time:units are wrong.

Doing $ ds_mf.to_netcdf('../test2.nc', encoding={'time':{'units': "hours since 1950-01-01 00:00:00"}}) gives the correct time units:

int64 time(time) ;
                time:long_name = "Time (hours since 1950-01-01)" ;
                time:standard_name = "time" ;
                time:axis = "T" ;
                time:_CoordinateAxisType = "Time" ;
                time:units = "hours since 1950-01-01" ;
                time:calendar = "proleptic_gregorian" ;
backeb commented 1 year ago
ln [3]: from netCDF4 import Dataset
In [4]: ds = Dataset('cmems_bottomT.nc')                                                                                                                                                              
In [7]: ds.variables['time'][:]                                                                                  
Out[7]: 
masked_array(data=[633324.0, --],
             mask=[False,  True],
       fill_value=nan)

vs

In [9]: ds = Dataset('tmp/cmems_bottomT_2022-04-01.nc')                                                          
In [10]: ds.variables['time'][:]                                                                                 
Out[10]: 
masked_array(data=[633324.],
             mask=False,
       fill_value=1e+20,
            dtype=float32)

Try:

Ln [22]: dsmf = xr.open_mfdataset('*bottomT*.nc', combine='by_coords', decode_times=False)                                                                                                          
In [23]: dsmf.to_netcdf('../test2.nc')      

In [8]: ds2.variables['time'][:]                                                                                                                                                                    
Out[8]: 
masked_array(data=[633324., 633348.],
             mask=False,
       fill_value=1e+20)

and

In [14]: for path in sorted(Path('/home/centos/data/download/tmp/').rglob('*bottomT*.nc')): 
    ...:     print(path) 
    ...:     ds = xr.open_dataset(path, decode_times=False) 
    ...:     datasets.append(ds) 

In [15]: combined = xr.concat(datasets, dim='time')                                                                                                                                                 
In [16]: combined.to_netcdf('../test.nc') 

In [2]: from netCDF4 import Dataset                                                                                                                                                                 
In [3]: ds = Dataset('test.nc')                                                                                                                                                                     
In [4]: ds.variables['time'][:]                                                                                                                                                                     
Out[4]: 
masked_array(data=[633324., 633348.],
             mask=False,
       fill_value=1e+20)
backeb commented 1 year ago

It was a xarray version issue. The below approach works in xarray=0.14.1

dsmf = xr.open_mfdataset('*bottomT*.nc', combine='by_coords', decode_times=False)                                                                                                          
dsmf.to_netcdf('../test2.nc') 

Added - xarray=0.14.1 to environment.yml