Thomas-Moore-Creative / Climatology-generator-demo

A demonstration / MVP to show how one could build an "interactive" climatology & compositing tool on Gadi HPC.
MIT License
0 stars 0 forks source link

write code to post-process draft netcdf output for delivery ( compression, chunking, metadata ) #9

Open Thomas-Moore-Creative opened 6 months ago

Thomas-Moore-Creative commented 6 months ago
Thomas-Moore-Creative commented 6 months ago

for the 7.26GB 2D output files the following code ( level 5 ) compresses to 3.3GB

def compress_nc(ds,out_path):
    compression_opts = {
        'zlib': True,        # Enable zlib compression
        'complevel': 5,      # Compression level (1-9)
    }

    # Set the encoding for each variable
    encoding = {var: compression_opts for var in ds.data_vars}

    # Write the dataset to a new NetCDF file with compression
    ds.to_netcdf(out_path, encoding=encoding)

this took : Wall time: 3min 27s

compression = 9 was Wall time: 11min 34s and size was 3.2GB compression = 1 was Wall time: 2min 45s and size was 3.3GB

Thomas-Moore-Creative commented 6 months ago

in this notebook: compress_netcdf there's the following function:

def write_compressed_netcdf(dataset,file_path,compression_level=4):
    encoding = {}
    for var_name in dataset.data_vars:
        encoding[var_name] = {'zlib': True, 'complevel': compression_level}
    dataset.to_netcdf(file_path, encoding=encoding)
Thomas-Moore-Creative commented 6 months ago

NB: @matt-csiro approach for chunking 3D BRAN2020 is: xt_ocean = 300 yt_ocean = 300 st_ocean = -1 Time = -1

Thomas-Moore-Creative commented 5 months ago

/g/data/es60/users/thomas_moore/clim_demo_results/daily/bran2020_intermediate_results

(base) tm4888@gadi-login-09 /g/data/es60/users/thomas_moore/clim_demo_results/daily/bran2020_intermediate_results ls -l *.nc
-rw-r--r-- 1 tm4888 es60   483373521 May 20 14:34 BRAN2020_base_stats_eta_t_alltime_2024.05.20.14.30.28.nc
-rw-r--r-- 1 tm4888 es60   494279790 May 20 15:05 BRAN2020_base_stats_eta_t_el_nino_2024.05.20.14.30.28.nc
-rw-r--r-- 1 tm4888 es60   490290307 May 20 14:55 BRAN2020_base_stats_eta_t_la_nina_2024.05.20.14.30.28.nc
-rw-r--r-- 1 tm4888 es60   487237913 May 27 11:33 BRAN2020_base_stats_eta_t_neutral_2024.05.22.11.07.01.nc
-rw-r--r-- 1 tm4888 es60   474453729 May 20 14:19 BRAN2020_base_stats_mld_alltime_2024.05.20.14.15.51.nc
-rw-r--r-- 1 tm4888 es60   499982008 May 20 14:50 BRAN2020_base_stats_mld_el_nino_2024.05.20.14.15.51.nc
-rw-r--r-- 1 tm4888 es60   493199928 May 20 14:40 BRAN2020_base_stats_mld_la_nina_2024.05.20.14.15.51.nc
-rw-r--r-- 1 tm4888 es60   486271869 May 27 11:33 BRAN2020_base_stats_mld_neutral_2024.05.22.09.55.36.nc
-rw-r--r-- 1 tm4888 es60 17135943610 May 18 15:22 BRAN2020_base_stats_salt_alltime_2024.05.18.09.19.14.nc
-rw-r--r-- 1 tm4888 es60 17324048046 May 19 14:29 BRAN2020_base_stats_salt_el_nino_2024.05.18.20.03.31.nc
-rw-r--r-- 1 tm4888 es60 17261157454 May 19 07:08 BRAN2020_base_stats_salt_la_nina_2024.05.18.20.03.31.nc
-rw-r--r-- 1 tm4888 es60 17158352065 May 27 11:33 BRAN2020_base_stats_salt_neutral_2024.05.22.02.52.35.nc
-rw-r--r-- 1 tm4888 es60 19143299023 May 18 04:42 BRAN2020_base_stats_temp_alltime_2024.05.18.00.05.08.nc
-rw-r--r-- 1 tm4888 es60 19590262769 May 19 17:26 BRAN2020_base_stats_temp_el_nino_2024.05.19.13.12.37.nc
-rw-r--r-- 1 tm4888 es60 19445789296 May 19 06:54 BRAN2020_base_stats_temp_la_nina_2024.05.18.18.54.38.nc
-rw-r--r-- 1 tm4888 es60 19343013301 May 27 11:34 BRAN2020_base_stats_temp_neutral_2024.05.22.06.24.27.nc
-rw-r--r-- 1 tm4888 es60 23791916518 May 18 15:41 BRAN2020_base_stats_u_alltime_2024.05.18.10.35.04.nc
-rw-r--r-- 1 tm4888 es60 23966192982 May 20 07:18 BRAN2020_base_stats_u_el_nino_2024.05.20.03.19.04.nc
-rw-r--r-- 1 tm4888 es60 23915150415 May 20 05:12 BRAN2020_base_stats_u_la_nina_2024.05.20.00.30.07.nc
-rw-r--r-- 1 tm4888 es60 23897851585 May 27 11:35 BRAN2020_base_stats_u_neutral_2024.05.21.22.21.38.nc
-rw-r--r-- 1 tm4888 es60 24381911568 May 18 15:47 BRAN2020_base_stats_v_alltime_2024.05.18.10.42.35.nc
-rw-r--r-- 1 tm4888 es60 24334060247 May 20 07:21 BRAN2020_base_stats_v_el_nino_2024.05.20.03.13.51.nc
-rw-r--r-- 1 tm4888 es60 24341457122 May 20 07:11 BRAN2020_base_stats_v_la_nina_2024.05.20.03.08.03.nc
-rw-r--r-- 1 tm4888 es60 24383438225 May 27 11:36 BRAN2020_base_stats_v_neutral_2024.05.22.11.17.59.nc
-rw-r--r-- 1 tm4888 es60   361580012 May 20 14:38 BRAN2020_quantile_stats_eta_t_alltime_2024.05.20.14.30.28.nc
-rw-r--r-- 1 tm4888 es60   371776318 May 20 15:08 BRAN2020_quantile_stats_eta_t_el_nino_2024.05.20.14.30.28.nc
-rw-r--r-- 1 tm4888 es60   369596738 May 20 14:58 BRAN2020_quantile_stats_eta_t_la_nina_2024.05.20.14.30.28.nc
-rw-r--r-- 1 tm4888 es60   366665725 May 27 11:36 BRAN2020_quantile_stats_eta_t_neutral_2024.05.22.11.07.01.nc
-rw-r--r-- 1 tm4888 es60   371868199 May 20 14:23 BRAN2020_quantile_stats_mld_alltime_2024.05.20.14.15.51.nc
-rw-r--r-- 1 tm4888 es60   385676782 May 20 14:53 BRAN2020_quantile_stats_mld_el_nino_2024.05.20.14.15.51.nc
-rw-r--r-- 1 tm4888 es60   382437241 May 20 14:43 BRAN2020_quantile_stats_mld_la_nina_2024.05.20.14.15.51.nc
-rw-r--r-- 1 tm4888 es60   378458416 May 27 11:36 BRAN2020_quantile_stats_mld_neutral_2024.05.22.09.55.36.nc
-rw-r--r-- 1 tm4888 es60  9916982735 May 18 18:42 BRAN2020_quantile_stats_salt_alltime_2024.05.18.09.19.14.nc
-rw-r--r-- 1 tm4888 es60 11287376016 May 19 18:02 BRAN2020_quantile_stats_salt_el_nino_2024.05.18.20.03.31.nc
-rw-r--r-- 1 tm4888 es60 11054439201 May 19 10:41 BRAN2020_quantile_stats_salt_la_nina_2024.05.18.20.03.31.nc
-rw-r--r-- 1 tm4888 es60 10825667081 May 27 11:37 BRAN2020_quantile_stats_salt_neutral_2024.05.22.02.52.35.nc
-rw-r--r-- 1 tm4888 es60 11149967176 May 18 07:35 BRAN2020_quantile_stats_temp_alltime_2024.05.18.00.05.08.nc
-rw-r--r-- 1 tm4888 es60 12813681330 May 19 21:36 BRAN2020_quantile_stats_temp_el_nino_2024.05.19.13.12.37.nc
-rw-r--r-- 1 tm4888 es60 12550705334 May 19 16:45 BRAN2020_quantile_stats_temp_la_nina_2024.05.19.12.59.34.nc
-rw-r--r-- 1 tm4888 es60 12275763465 May 27 11:37 BRAN2020_quantile_stats_temp_neutral_2024.05.22.06.24.27.nc
-rw-r--r-- 1 tm4888 es60 15794470250 May 18 19:36 BRAN2020_quantile_stats_u_alltime_2024.05.18.10.35.04.nc
-rw-r--r-- 1 tm4888 es60 16899293259 May 20 01:48 BRAN2020_quantile_stats_u_el_nino_2024.05.19.21.37.51.nc
-rw-r--r-- 1 tm4888 es60 16853389695 May 20 00:53 BRAN2020_quantile_stats_u_la_nina_2024.05.19.21.11.12.nc
-rw-r--r-- 1 tm4888 es60 16674048388 May 27 11:38 BRAN2020_quantile_stats_u_neutral_2024.05.21.22.21.38.nc
-rw-r--r-- 1 tm4888 es60 15682542436 May 18 19:13 BRAN2020_quantile_stats_v_alltime_2024.05.18.10.42.35.nc
-rw-r--r-- 1 tm4888 es60 16905918949 May 20 11:43 BRAN2020_quantile_stats_v_el_nino_2024.05.20.07.57.29.nc
-rw-r--r-- 1 tm4888 es60 16813994768 May 20 12:01 BRAN2020_quantile_stats_v_la_nina_2024.05.20.07.49.04.nc
-rw-r--r-- 1 tm4888 es60 16605999410 May 27 11:39 BRAN2020_quantile_stats_v_neutral_2024.05.22.11.17.59.nc
Thomas-Moore-Creative commented 5 months ago

compression - files are now compressed via zlib 5 when written at intermediate level

Thomas-Moore-Creative commented 5 months ago

Merged datasets dictionary with renamed variables

dict_keys(['temp_alltime_ds', 'temp_neutral_ds', 'temp_la_nina_ds', 'temp_el_nino_ds',
 'salt_alltime_ds','salt_neutral_ds', 'salt_la_nina_ds', 'salt_el_nino_ds', 'u_alltime_ds', 
'u_neutral_ds', 'u_la_nina_ds','u_el_nino_ds', 'v_alltime_ds', 'v_neutral_ds', 
'v_la_nina_ds', 'v_el_nino_ds', 'eta_t_alltime_ds', 'eta_t_neutral_ds','eta_t_la_nina_ds',
 'eta_t_el_nino_ds', 'mld_alltime_ds', 'mld_neutral_ds', 'mld_la_nina_ds', 'mld_el_nino_ds'])
# Calculate the total size of all datasets in the dictionary
total_size_gb = sum(merged_dataset.nbytes / (1024**3) for merged_dataset in merged_datasets.values())
print(f"Total size of all datasets: {total_size_gb} GB")

Total size of all datasets: 1991.0576639771461 GB

`/g/data/es60/users/thomas_moore/clim_demo_results/daily/bran2020_intermediate_results du -hsc *.nc` = 532G total
Thomas-Moore-Creative commented 5 months ago

current chunking

3D CleanShot 2024-05-29 at 11 55 55@2x 2D CleanShot 2024-05-29 at 12 02 38@2x

Thomas-Moore-Creative commented 5 months ago

write time for 2D for chunk(12,1500,3600) @ 2D

CleanShot 2024-05-29 at 13 21 49@2x

Thomas-Moore-Creative commented 5 months ago

@matt-csiro - my first attempt to write to netcdf 2D `mld @ 1,300,300 chunks grinds along at a snails pace. What took 3 minutes is 2% finished in 10 minutes. I could likely be doing this differently . . .

Would you normally use cdo or nco? I'm guessing reading things into memory is required for these approaches but would speed things up a lot writing out to many tiny chunks?

CleanShot 2024-05-29 at 13 55 49@2x

matt-csiro commented 5 months ago

Hmm, that's interesting.
NCOs have been my tools for handling chunking and which typically haven't been so sensitive to chunk sizes.

Thomas-Moore-Creative commented 5 months ago

@matt-csiro : as expected if I do everything IN MEMORY then writing all these tiny chunks isn't much of an issue CleanShot 2024-05-30 at 12 11 48@2x

Any response on their requests?

Thomas-Moore-Creative commented 5 months ago

approach for encoding compression and chunking into final netcdf

mld_tiny_chunk = mld_tiny_chunk.compute() # data must be IN MEMORY
encoding = {} #setup encoding dict
chunksizes_tuple = (1, 300, 300) #set chunksizes for netcdf write
for var_name in mld_tiny_chunk.data_vars:
    encoding[var_name] = {'zlib': True, 'complevel': 5, 'dtype': 'float32', 'chunksizes': chunksizes_tuple} # encode only the data variables
# Save to NetCDF with chunking and compression encoding
mld_tiny_chunk.to_netcdf('/g/data/es60/users/thomas_moore/clim_demo_results/daily/bran2020_final_results/mld_01300300.nc',
                         engine='netcdf4',encoding=encoding)
Thomas-Moore-Creative commented 5 months ago

need to write batch code to combine the intermediate netcdf files into one per core variable