NOC-OI / msm-os

A library to streamline the transfer, update, and deletion of Zarr files within object store environments.
MIT License
2 stars 0 forks source link

Unwarranted warning #11

Closed accowa closed 1 month ago

accowa commented 1 month ago

Version 0.1.1 seems to be working much better in intial tests. I'm just writing a series of monthly 1d means to the same bucket with successive calls with identical arguments (except for the input filename). I'm getting this warning on the second and subsequent calls:

☁ msm_os ☁ | WARNING | 2024-09-13 15:49:49 | You already have data in the object store and you can't rechunk it

which is unwarranted since the chunk settings haven't changed?

accowa commented 1 month ago

Successfully wrote a whole year of eORCA12 1 day T-grid mean fields to a bucket (s3://npd12-j001-t1d-1976). Each month was successfully appended in turn but the whole sequence took over 2 hours to complete. This isn't going to be feasible for all the output (remaining U- V- 1d means, all 5d means, all monthly means and annual means) unless the parallelisation via dask allows significant speed-up and is reliable.

One observed oddity is that the time_counter dataset has a chunk size of 31 (presumably fixed by the length of January?). This is despite the explicit request of:

-cs '{"time_counter": 1, "x": 720, "y": 360}'

in each call. Maybe this is related to the unwarranted warning? Chunksizes have been respected elsewhere; just not for time_counter itself. E.g.:


ncdump -h -s https://noc-msm-o.s3-ext.jc.rl.ac.uk/npd12-j001-t1d-1976/T1d/#mode=nczarr,s3
netcdf \#mode\=nczarr\,s3 {
dimensions:
    y = 3605 ;
    x = 4320 ;
    nvertex = 4 ;
    time_counter = 366 ;
    axis_nbounds = 2 ;
.
.

    double time_centered(time_counter) ;
        time_centered:bounds = "time_centered_bounds" ;
        time_centered:calendar = "gregorian" ;
        time_centered:long_name = "Time axis" ;
        time_centered:standard_name = "time" ;
        time_centered:time_origin = "1900-01-01 00:00:00" ;
        time_centered:units = "seconds since 1900-01-01 00:00:00" ;
        time_centered:_Storage = "chunked" ;
        time_centered:_ChunkSizes = 1 ;
        time_centered:_Filter = "32001,0,0,0,0,5,1,1" ;
        time_centered:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        time_centered:_Endianness = "little" ;
    double time_counter(time_counter) ;
        time_counter:axis = "T" ;
        time_counter:bounds = "time_counter_bounds" ;
        time_counter:calendar = "gregorian" ;
        time_counter:long_name = "Time axis" ;
        time_counter:standard_name = "time" ;
        time_counter:time_origin = "1900-01-01 00:00:00" ;
        time_counter:units = "seconds since 1900-01-01" ;
        time_counter:_Storage = "chunked" ;
    VVVVVVVVVVVVVVVVVVVVVVV
        time_counter:_ChunkSizes = 31 ;
    ^^^^^^^^^^^^^^^^^^^^^^^^^
        time_counter:_Filter = "32001,0,0,0,0,5,1,1" ;
        time_counter:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        time_counter:_Endianness = "little" ;
    float tossq_con(time_counter, y, x) ;
        tossq_con:cell_methods = "time: mean (interval: 300 s)" ;
        tossq_con:coordinates = "time_centered nav_lat nav_lon" ;
        tossq_con:interval_operation = "300 s" ;
        tossq_con:interval_write = "1 d" ;
        tossq_con:long_name = "square_of_sea_surface_conservative_temperature" ;
        tossq_con:missing_value = 1.00000002004088e+20 ;
        tossq_con:online_operation = "average" ;
        tossq_con:standard_name = "square_of_sea_surface_temperature" ;
        tossq_con:units = "degC2" ;
        tossq_con:_Storage = "chunked" ;
        tossq_con:_ChunkSizes = 1, 360, 720 ;
        tossq_con:_Filter = "32001,0,0,0,0,5,1,1" ;
        tossq_con:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
soutobias commented 1 month ago

Hi @accowa ,

I included that warning to inform users that once new data is added to the object store, rechunking is not possible. However, I’ll adjust the warning so that it only appears when users attempt to modify the chunking strategy.

Regarding the time_counter data, I’ll investigate why it isn't being chunked the same way as the other variables. From a quick review of the code, there doesn’t seem to be a specific reason for excluding time_counter from chunking.

In my opinion, given that it's a 1D array of time data, we may not need to chunk it. If you prefer, I can include it in the chunking strategy.

As for the execution time, Dask does speed up data uploads by parallelizing the transfer of each variable, but I am not sure on how much the performance will increase. However, there are two main bottlenecks right now:

Please let me know if you're happy with these changes.

accowa commented 1 month ago

The time_counter variable itself doesn't really need to be chunked but a chunk size of 1 does need to be applied to the time_counter dimension because most accesses of 2 and 3d slices/volumes will be done a time-slice at a time. All variables except time_counter have been chunked correctly so it is odd that time_counter has been treated differently. We should get to the bottom of that.

Yes, I think a flag to switch off the integrity check would be useful. The checks are useful when establishing new work flows but we may need to take risks to achieve a workable solution. At least such a flag will identify if the checks are indeed a major bottleneck.

If it helps, I see the priority order as: 1. flag to suppress integrity checks 2. Dask robustness 3. time_counter chunking. 4. Suppress unwarranted warning

soutobias commented 1 month ago

Hi @accowa ,

I've implemented the following changes:

  1. Flag to suppress integrity checks: Added a new flag skip_integrity_check (-si or --skip-integrity-check). By default, it's set to False (integrity checks are applied). I'll update the NEMO_CYLC documentation and codes accordingly.

  2. Dask integration for checksum calculation: I've integrated Dask for the checksum calculation, which in my tests has shown a 2-5x speed improvement. However, this performance boost will vary depending on the machine and whether Dask is already being used to load data into the object store.

  3. time_counter chunking: I’ve added debug output (it is not on the code in production) to investigate why the time_counter variable isn’t being chunked as expected:

    if len(new_chunking.keys()) > 0:
       print(f"Rechunking {variable} to {new_chunking}")
       ds_filepath[variable] = ds_filepath[
           variable
       ].chunk(new_chunking)
       print(f"New chunking: {ds_filepath[variable].chunks}")

    The output I'm seeing is:

    Rechunking tossq_con to {'time_counter': 1, 'x': 720, 'y': 360}
    New chunking: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
    Rechunking nav_lat to {'x': 720, 'y': 360}
    New chunking: ((360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
    Rechunking nav_lon to {'x': 720, 'y': 360}
    New chunking: ((360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 5), (720, 720, 720, 720, 720, 720))
    Rechunking time_centered to {'time_counter': 1}
    New chunking: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),)
    Rechunking time_counter to {'time_counter': 1}
    New chunking: None

    It seems that time_counter isn't being chunked even though the code runs without errors. To clarify, the time_counter variable is not the same as the time_counter dimension—it merely contains the names of each time_counter timestamp. While it's odd that time_counter is being handled differently, it doesn’t impact the chunking strategy or data itself. I’ll open a new issue to address this in more detail later.

  4. Suppress unwarranted warnings: I’ve adjusted the warning so it will only appear if the chunk strategy being applied differs from the original one. No warnings will be shown if the chunk strategy remains unchanged.

accowa commented 1 month ago

The time_counter non-chunking is likely a Xarray bug. This looks relevant: https://github.com/pydata/xarray/issues/6204

soutobias commented 1 month ago

Thanks for pointing me to this link. Considering this, I think the only way to solve this problem now is to rename the coordinate (I don't recommend this), or use the zarr library (and not xarray) the first time I upload the data. In this case, right after uploading the data, I would open the time_counter variable and rechunk it to the chosen strategy. The other data that will be appended later will automatically follow the chunk strategy defined in the first upload. I will add this information to the new issue #14