FESOM / fesom2

Multi-resolution ocean general circulation model.
http://fesom.de/
GNU General Public License v3.0
47 stars 48 forks source link

feat(compression): WIP, start of allowing FESOM to write compressed o… #587

Closed pgierz closed 4 months ago

pgierz commented 4 months ago

…utput

Tag to @JanStreffing for more work on this.

Basics: about ~50% less disk space for ~10% more wall time, subject to mesh choices, scalability, etc.

JanStreffing commented 4 months ago

@koldunovn The way to get in zstd, is to just compile netcdf and hdf5 in the correct way, as described here: https://github.com/FESOM/FESOM_compression/blob/main/README.md, correct?

JanStreffing commented 4 months ago

Works and IMO can be merged as is. List of Benchmarks:

CORE2, 128 nodes, monthly 3D output plus some daily 2D 1 month, no compression, 7m25s 1 month, compression level 9, 7m36s

@pgierz, can you add the outdata volumes?

More tests at larger number of nodes and with bigger meshes later.

koldunovn commented 4 months ago

https://github.com/FESOM/FESOM_compression/blob/main/README.md

Not sure it's the right reference, but my understanding, yes, you have to build netCDF that supports new "filters" with zstd support. Might be relevant: https://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2022/msg00032.html

pgierz commented 4 months ago

@JanStreffing here you go:

For all outputs:

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2
❯ du -sc result_tmp
5779216 result_tmp
5779216 total

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2
❯ du -sc result_tmp_no_compress/
5835916 result_tmp_no_compress/
5835916 total

And individually:

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2 took 59s
❯ ls -ratl result_tmp
total 149160
-rw-r--r--  1 a270077 ab0246 98289786 May 13 16:12 fesom.mesh.diag.nc
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:19 fesom_raw_restart
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:19 fesom_bin_restart
drwxr-sr-x  6 a270077 ab0246     4096 May 13 16:19 .
-rw-r--r--  1 a270077 ab0246      102 May 13 16:19 fesom.clock
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:19 fesom.1958.oce.restart
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:19 fesom.1958.ice.restart
-rw-r--r--  1 a270077 ab0246  4333826 May 13 16:19 uice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   931056 May 13 16:19 ty_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   930926 May 13 16:19 tx_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   474180 May 13 16:19 MLD3.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   471431 May 13 16:19 MLD2.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   169114 May 13 16:19 m_ice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246  4338172 May 13 16:19 vice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 13735260 May 13 16:19 temp.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   470354 May 13 16:19 sst.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   426784 May 13 16:19 sss.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   482735 May 13 16:19 ssh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 25812750 May 13 16:19 salt.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   173484 May 13 16:19 m_snow.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   468782 May 13 16:19 MLD1.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   493660 May 13 16:19 fw.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   489496 May 13 16:19 fh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   164197 May 13 16:19 a_ice.fesom.1958.nc
drwxr-sr-x 22 a270077 ab0246     4096 May 13 16:22 ..

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2
❯ ls -ratl result_tmp_no_compress/
total 205848
drwxr-sr-x 22 a270077 ab0246     4096 May 13 16:22 ..
-rw-r--r--  1 a270077 ab0246 98289786 May 13 16:23 fesom.mesh.diag.nc
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:30 fesom_raw_restart
-rw-r--r--  1 a270077 ab0246      102 May 13 16:30 fesom.clock
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:30 fesom_bin_restart
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:30 fesom.1958.oce.restart
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:30 fesom.1958.ice.restart
drwxr-sr-x  6 a270077 ab0246     4096 May 13 16:30 .
-rw-r--r--  1 a270077 ab0246 15752685 May 13 16:30 vice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 15752685 May 13 16:30 uice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   999617 May 13 16:30 ty_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   999607 May 13 16:30 tx_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 24383368 May 13 16:30 temp.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528361 May 13 16:30 sst.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528357 May 13 16:30 sss.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528357 May 13 16:30 ssh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 48742784 May 13 16:30 salt.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528337 May 13 16:30 m_snow.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 MLD3.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 MLD2.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 MLD1.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528335 May 13 16:30 m_ice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 fw.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528333 May 13 16:30 fh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 a_ice.fesom.1958.nc
koldunovn commented 4 months ago

@pgierz So, this is just standard netCDF compression (zlib)? @JanStreffing level 9 is too agressive, I think you can have good results already even with 1, and maybe 3 is optimal, but need some experimenting :)

JanStreffing commented 4 months ago

@JanStreffing here you go:

For all outputs:

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2
❯ du -sc result_tmp
5779216   result_tmp
5779216   total

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2
❯ du -sc result_tmp_no_compress/
5835916   result_tmp_no_compress/
5835916   total

And individually:

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2 took 59s
❯ ls -ratl result_tmp
total 149160
-rw-r--r--  1 a270077 ab0246 98289786 May 13 16:12 fesom.mesh.diag.nc
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:19 fesom_raw_restart
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:19 fesom_bin_restart
drwxr-sr-x  6 a270077 ab0246     4096 May 13 16:19 .
-rw-r--r--  1 a270077 ab0246      102 May 13 16:19 fesom.clock
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:19 fesom.1958.oce.restart
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:19 fesom.1958.ice.restart
-rw-r--r--  1 a270077 ab0246  4333826 May 13 16:19 uice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   931056 May 13 16:19 ty_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   930926 May 13 16:19 tx_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   474180 May 13 16:19 MLD3.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   471431 May 13 16:19 MLD2.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   169114 May 13 16:19 m_ice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246  4338172 May 13 16:19 vice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 13735260 May 13 16:19 temp.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   470354 May 13 16:19 sst.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   426784 May 13 16:19 sss.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   482735 May 13 16:19 ssh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 25812750 May 13 16:19 salt.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   173484 May 13 16:19 m_snow.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   468782 May 13 16:19 MLD1.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   493660 May 13 16:19 fw.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   489496 May 13 16:19 fh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   164197 May 13 16:19 a_ice.fesom.1958.nc
drwxr-sr-x 22 a270077 ab0246     4096 May 13 16:22 ..

a270077 in 🌐 levante0 in fesom2 on  refactoring-compress [!?] via △ v3.20.2
❯ ls -ratl result_tmp_no_compress/
total 205848
drwxr-sr-x 22 a270077 ab0246     4096 May 13 16:22 ..
-rw-r--r--  1 a270077 ab0246 98289786 May 13 16:23 fesom.mesh.diag.nc
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:30 fesom_raw_restart
-rw-r--r--  1 a270077 ab0246      102 May 13 16:30 fesom.clock
drwxr-sr-x  3 a270077 ab0246     4096 May 13 16:30 fesom_bin_restart
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:30 fesom.1958.oce.restart
drwxr-sr-x  2 a270077 ab0246     4096 May 13 16:30 fesom.1958.ice.restart
drwxr-sr-x  6 a270077 ab0246     4096 May 13 16:30 .
-rw-r--r--  1 a270077 ab0246 15752685 May 13 16:30 vice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 15752685 May 13 16:30 uice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   999617 May 13 16:30 ty_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   999607 May 13 16:30 tx_sur.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 24383368 May 13 16:30 temp.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528361 May 13 16:30 sst.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528357 May 13 16:30 sss.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528357 May 13 16:30 ssh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246 48742784 May 13 16:30 salt.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528337 May 13 16:30 m_snow.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 MLD3.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 MLD2.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 MLD1.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528335 May 13 16:30 m_ice.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 fw.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528333 May 13 16:30 fh.fesom.1958.nc
-rw-r--r--  1 a270077 ab0246   528349 May 13 16:30 a_ice.fesom.1958.nc

Could you remove the restarts first?

JanStreffing commented 4 months ago

@pgierz So, this is just standard netCDF compression (zlib)? @JanStreffing level 9 is too agressive, I think you can have good results already even with 1, and maybe 3 is optimal, but need some experimenting :)

agreed, I will test with level 1, which is what I use for OpenIFS.

pgierz commented 4 months ago

Could you remove the restarts first?

Sure, without restart files:

du -sc result_tmp result_tmp_no_compress
149140  result_tmp
205828  result_tmp_no_compress
354968  total

And once graphically:

Unknown