NOC-OI / msm-os

A library to streamline the transfer, update, and deletion of Zarr files within object store environments.
MIT License
2 stars 0 forks source link

The `time_counter' variable is not being chunked split with the given chunk strategy #14

Open soutobias opened 1 week ago

soutobias commented 1 week ago

One observed oddity is that the time_counter dataset has a chunk size of 31 (presumably fixed by the length of January?). This is despite the explicit request of:

-cs '{"time_counter": 1, "x": 720, "y": 360}'

in each call. Maybe this is related to the unwarranted warning? Chunksizes have been respected elsewhere; just not for time_counter itself. E.g.:

ncdump -h -s https://noc-msm-o.s3-ext.jc.rl.ac.uk/npd12-j001-t1d-1976/T1d/#mode=nczarr,s3
netcdf \#mode\=nczarr\,s3 {
dimensions:
    y = 3605 ;
    x = 4320 ;
    nvertex = 4 ;
    time_counter = 366 ;
    axis_nbounds = 2 ;
.
.

    double time_centered(time_counter) ;
        time_centered:bounds = "time_centered_bounds" ;
        time_centered:calendar = "gregorian" ;
        time_centered:long_name = "Time axis" ;
        time_centered:standard_name = "time" ;
        time_centered:time_origin = "1900-01-01 00:00:00" ;
        time_centered:units = "seconds since 1900-01-01 00:00:00" ;
        time_centered:_Storage = "chunked" ;
        time_centered:_ChunkSizes = 1 ;
        time_centered:_Filter = "32001,0,0,0,0,5,1,1" ;
        time_centered:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        time_centered:_Endianness = "little" ;
    double time_counter(time_counter) ;
        time_counter:axis = "T" ;
        time_counter:bounds = "time_counter_bounds" ;
        time_counter:calendar = "gregorian" ;
        time_counter:long_name = "Time axis" ;
        time_counter:standard_name = "time" ;
        time_counter:time_origin = "1900-01-01 00:00:00" ;
        time_counter:units = "seconds since 1900-01-01" ;
        time_counter:_Storage = "chunked" ;
    VVVVVVVVVVVVVVVVVVVVVVV
        time_counter:_ChunkSizes = 31 ;
    ^^^^^^^^^^^^^^^^^^^^^^^^^
        time_counter:_Filter = "32001,0,0,0,0,5,1,1" ;
        time_counter:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        time_counter:_Endianness = "little" ;
    float tossq_con(time_counter, y, x) ;
        tossq_con:cell_methods = "time: mean (interval: 300 s)" ;
        tossq_con:coordinates = "time_centered nav_lat nav_lon" ;
        tossq_con:interval_operation = "300 s" ;
        tossq_con:interval_write = "1 d" ;
        tossq_con:long_name = "square_of_sea_surface_conservative_temperature" ;
        tossq_con:missing_value = 1.00000002004088e+20 ;
        tossq_con:online_operation = "average" ;
        tossq_con:standard_name = "square_of_sea_surface_temperature" ;
        tossq_con:units = "degC2" ;
        tossq_con:_Storage = "chunked" ;
        tossq_con:_ChunkSizes = 1, 360, 720 ;
        tossq_con:_Filter = "32001,0,0,0,0,5,1,1" ;
        tossq_con:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;

I’ve added debug output (it is not on the code in production) to investigate why the time_counter variable isn’t being chunked as expected:

   if len(new_chunking.keys()) > 0:
       print(f"Rechunking {variable} to {new_chunking}")
       ds_filepath[variable] = ds_filepath[
           variable
       ].chunk(new_chunking)
       print(f"New chunking: {ds_filepath[variable].chunks}")

The output I'm seeing is:

   Rechunking tossq_con to {'time_counter': 1, 'x': 720, 'y': 360}
   New chunking: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
   Rechunking nav_lat to {'x': 720, 'y': 360}
   New chunking: ((360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
   Rechunking nav_lon to {'x': 720, 'y': 360}
   New chunking: ((360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
   Rechunking time_centered to {'time_counter': 1}
   New chunking: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),)
   Rechunking time_counter to {'time_counter': 1}
   New chunking: None

It seems that time_counter isn't being chunked even though the code runs without errors.

soutobias commented 1 week ago

As pointed by @accowa , this could be a Xarray bug (https://github.com/pydata/xarray/issues/6204).

Considering this, I think the only way to solve this problem now is to rename the coordinate (I don't recommend this), or use the zarr library (and not xarray) the first time I upload the data. In this case, right after uploading the data, I would open the time_counter variable and rechunk it to the chosen strategy. The other data that will be appended later will automatically follow the chunk strategy defined in the first upload.