NOC-OI / msm-os

A library to streamline the transfer, update, and deletion of Zarr files within object store environments.
MIT License
1 stars 0 forks source link

--chunk_strategy not recognised (but -cs works) #4

Closed accowa closed 2 days ago

accowa commented 2 months ago

msm_os: error: unrecognized arguments: --chunk_strategy {"time_counter": 1, "x": 720, "y": 360}

but changing --chunk_strategy to -cs is accepted.

Mis-match between documentation and code in msm-os/src/msm_os/cli/argument_parser.py:

  parser.add_argument(
        "-cs",
        "--chunk-strategy",
        dest="chunk_strategy",

Documentation says--chunk_strategy, code accepts --chunk-strategy

accowa commented 1 month ago

Actually, the issues are deeper than just a documentation error. Even when accepted, the supplied chunking information is not applied. chunk_strategy is set but I can't see where it gets used and the test on rechunk is never true.

chunk_strategy needs to be added to the send arguments in main_cli.py:

index 0688fd8..c4aebf3 100644
--- a/src/msm_os/cli/main_cli.py
+++ b/src/msm_os/cli/main_cli.py
@@ -59,6 +59,7 @@ def process_action(args):
             append_dim=args.append_dim,
             send_vars_indep=not send_vars_indep,
             object_prefix=args.object_prefix,
+            rechunk=args.chunk_strategy,
             to_zarr_kwargs=None,
         )

This would suggest that some recent changes have not being thoroughly tested. :(

accowa commented 1 month ago

That change helps but doesn't give complete control over chunking. Running this, for example,:

msm_os send --file "./eORCA12_1d_grid_T_197602-197602.nc" --bucket "npd12-j001-test" --append-dim "time_counter" --credentials "mycreds" -v "tos_con" -cs '{"time_counter": 1, "x": 720, "y": 360}'

correctly chunks the requested variable but not the associated coordinate fields (it is very unclear when these get added)

/home/users/acc/newTOOLS/nczarr/newjas/bin/ncdump -h -s https://noc-msm-o.s3-ext.jc.rl.ac.uk/npd12-j001-test/T1d#mode=nczarr,s3
netcdf T1d {
dimensions:
    y = 3605 ;
    x = 4320 ;
    time_counter = 29 ;

// global attributes:
        :_SuperblockVersion = 0 ;
        :_IsNetcdf4 = 1 ;
        :_Format = "netCDF-4" ;

group: tos_con.zarr {
  variables:
    float nav_lat(y, x) ;
        nav_lat:bounds = "bounds_nav_lat" ;
        nav_lat:long_name = "Latitude" ;
        nav_lat:standard_name = "latitude" ;
        nav_lat:units = "degrees_north" ;
        nav_lat:_Storage = "chunked" ;
        nav_lat:_ChunkSizes = 451, 540 ;
        nav_lat:_Filter = "32001,0,0,0,0,5,1,1" ;
        nav_lat:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        nav_lat:_Endianness = "little" ;
    float nav_lon(y, x) ;
        nav_lon:bounds = "bounds_nav_lon" ;
        nav_lon:long_name = "Longitude" ;
        nav_lon:standard_name = "longitude" ;
        nav_lon:units = "degrees_east" ;
        nav_lon:_Storage = "chunked" ;
        nav_lon:_ChunkSizes = 451, 540 ;
        nav_lon:_Filter = "32001,0,0,0,0,5,1,1" ;
        nav_lon:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        nav_lon:_Endianness = "little" ;
    double time_centered(time_counter) ;
        time_centered:bounds = "time_centered_bounds" ;
        time_centered:calendar = "gregorian" ;
        time_centered:long_name = "Time axis" ;
        time_centered:standard_name = "time" ;
        time_centered:time_origin = "1900-01-01 00:00:00" ;
        time_centered:units = "seconds since 1900-01-01" ;
        time_centered:_Storage = "chunked" ;
        time_centered:_ChunkSizes = 29 ;
        time_centered:_Filter = "32001,0,0,0,0,5,1,1" ;
        time_centered:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        time_centered:_Endianness = "little" ;
    double time_counter(time_counter) ;
        time_counter:axis = "T" ;
        time_counter:bounds = "time_counter_bounds" ;
        time_counter:calendar = "gregorian" ;
        time_counter:expected_size = 29 ;
        time_counter:long_name = "Time axis" ;
        time_counter:standard_name = "time" ;
        time_counter:time_origin = "1900-01-01 00:00:00" ;
        time_counter:units = "seconds since 1900-01-01" ;
        time_counter:_Storage = "chunked" ;
        time_counter:_ChunkSizes = 29 ;
        time_counter:_Filter = "32001,0,0,0,0,5,1,1" ;
        time_counter:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        time_counter:_Endianness = "little" ;
    float tos_con(time_counter, y, x) ;
        tos_con:cell_methods = "time: mean (interval: 300 s)" ;
        tos_con:coordinates = "nav_lat nav_lon time_centered" ;
        tos_con:interval_operation = "300 s" ;
        tos_con:interval_write = "1 d" ;
        tos_con:long_name = "sea_surface_conservative_temperature" ;
        tos_con:online_operation = "average" ;
        tos_con:standard_name = "null" ;
        tos_con:units = "null" ;
        tos_con:_Storage = "chunked" ;
        tos_con:_ChunkSizes = 1, 360, 720 ;
        tos_con:_Filter = "32001,0,0,0,0,5,1,1" ;
        tos_con:_Codecs = "[{\"blocksize\": 0, \"clevel\": 5, \"cname\": \"lz4\", \"id\": \"blosc\", \"shuffle\": 1}]" ;
        tos_con:_Endianness = "little" ;

  // group attributes:
        :expected_checksum_1976-02-01T12\:00\:00.000000000 = 697648058032063542LL ;
        :expected_coords = "[\"none\"]" ;
        :expected_variables = "[\"tos_con\"]" ;
  } // group tos_con.zarr
}
soutobias commented 1 month ago

@accowa, I created a pull request to fix this error.

I tested this rechunk strategy only in the NEMO cylc package, but not in the command line code :(

My initial idea was to just rechunk the data (not the coordinates). But now I updated _rechunk_ds to include the coordinates too.

I'm running some tests and will merge after everything is working well.

accowa commented 1 month ago

I tested this rechunk strategy only in the NEMO cylc package, but not in the command line code :(

Not convinced that should make a difference since the scripts in NEMO_cylc end up calling the same command-line code

soutobias commented 1 month ago

In the NEMO_cylc, it calls the function directly in python https://github.com/NOC-OI/NEMO_cylc/blob/0cdcdef8c0855948436d3c8ffbfdd629cfdfab99/bin/transfer_to_os.py#L62 , not in the command line. In any case, I already corrected it. I will perform some tests tomorrow and I will merge the branch

accowa commented 1 month ago

In the NEMO_cylc, it calls the function directly in python https://github.com/NOC-OI/NEMO_cylc/blob/0cdcdef8c0855948436d3c8ffbfdd629cfdfab99/bin/transfer_to_os.py#L62 , not in the command line.

Ah I see. Having struggled with dask, I've been working with a modified version of: transfer_to_os.sh.serial as a template. That one does use the command-line form.