WCRP-CORDEX / simulation-status

CORDEX simulation status
https://wcrp-cordex.github.io/simulation-status
6 stars 16 forks source link

CORDEX-CMIP6 storage requirements estimation #40

Open jesusff opened 1 month ago

jesusff commented 1 month ago

In order to secure the archival of CMIP6-driven simulations in ESGF, we first need to estimate the volume of data that we will be producing. An accurate estimate is difficult due to the dependence of the compression rates on the different variables and even on the different models as the effective model resolution affects compression.

A very rough estimate should, in principle, be easily extracted by combining the information in:

jesusff commented 1 month ago

This is a first estimate, only for CORE variables and with an assumed average 60% compression rate wrt to raw binary float precision. This value can be adjusted (e.g. by comparing with real processed output).

ic| simulation_count: experiment  evaluation  historical  ssp119  ssp126  ssp245  ssp370  ssp585
                      domain
                      AFR-25               0           0       0       0       0       0       0
                      ANT-12               3           4       0       0       0       4       0
                      ARC-12               4           6       0       0       0       6       0
                      AUS-20i              4          34       0      34      20      34       4
                      CAM-12               1           8       0       8       8       0       8
                      CAS-12               0           2       0       2       0       2       2
                      EAS-25               0           5       0       5       5       5       5
                      EUR-12              19          59       9      58      18      55      21
                      MED-12              10           9       0       4       1       6       3
                      MED-25               1           1       0       0       0       0       0
                      MENA-25              1           1       0       1       1       1       1
                      NAM-12               1           8       0       4       4       8       1
                      NAM-25               1          15       0       0       0      15       0
                      SAM-25               1           1       0       0       0       1       0
                      SEA-12               1           3       0       0       0       3       0
                      SEA-25               3          13      10      13       3      13       3
                      WAS-25               2           5       0       1       4       2       4
ic| variable_count: priority   CORE  TIER1  TIER2
                    frequency
                    1hr          13     30      7
                    6hr           0     71     51
                    day          15    105     63
                    fx            2      0      7
                    mon          15    105     64
ic| variable_records_per_yr: priority     CORE   TIER1  TIER2
                             frequency
                             1hr        113880  262800  61320
                             6hr             0  103660  74460
                             day          5475   38325  22995
                             fx              0       0      0
                             mon           180    1260    768
ic| size_TB: experiment  evaluation  historical  ssp119  ssp126  ssp245  ssp370  ssp585
             domain
             AFR-25             0.0         0.0     0.0     0.0     0.0     0.0     0.0
             ANT-12             6.8        12.0     0.0     0.0     0.0    19.1     0.0
             ARC-12            11.6        22.9     0.0     0.0     0.0    36.5     0.0
             AUS-20i            4.9        54.4     0.0    86.6    50.9    86.6    10.2
             CAM-12             4.5        47.1     0.0    74.9    74.9     0.0    74.9
             CAS-12             0.0         7.6     0.0    12.1     0.0    12.1    12.1
             EAS-25             0.0         7.7     0.0    12.3    12.3    12.3    12.3
             EUR-12            39.0       159.7    38.8   250.0    77.6   237.0    90.5
             MED-12            11.6        13.8     0.0     9.7     2.4    14.6     7.3
             MED-25             0.3         0.4     0.0     0.0     0.0     0.0     0.0
             MENA-25            1.3         1.7     0.0     2.7     2.7     2.7     2.7
             NAM-12             3.8        40.0     0.0    31.8    31.8    63.6     8.0
             NAM-25             0.9        18.7     0.0     0.0     0.0    29.8     0.0
             SAM-25             1.1         1.5     0.0     0.0     0.0     2.4     0.0
             SEA-12             2.4         9.5     0.0     0.0     0.0    15.2     0.0
             SEA-25             1.8        10.3    12.6    16.4     3.8    16.4     3.8
             WAS-25             2.4         7.8     0.0     2.5     9.9     5.0     9.9
ic| size_TB.T.sum(): domain
                     AFR-25       0.0
                     ANT-12      37.9
                     ARC-12      71.0
                     AUS-20i    293.6
                     CAM-12     276.3
                     CAS-12      43.9
                     EAS-25      56.9
                     EUR-12     892.6
                     MED-12      59.4
                     MED-25       0.7
                     MENA-25     13.8
                     NAM-12     179.0
                     NAM-25      49.4
                     SAM-25       5.0
                     SEA-12      27.1
                     SEA-25      65.1
                     WAS-25      37.5
                     dtype: float64
/!\ Considering just ['CORE'] vars.)
Total CORDEX-CMIP6 estimated size is: 2109 TB
gnikulin commented 1 month ago

I think such an approach should work as an basic estimate. RCMs with non rotated grids usually provide larger domains but some RCM groups may provide only a subset of the Tier1 and 2 variables.

Are all simulations for EUR-12 only 892.6 Tb ?

jesusff commented 1 month ago

Yes, but this is only CORE variables and only those simulations planned so far in this repo. I paste here the summary including all tiers:

ic| simulation_count: experiment  evaluation  historical  ssp119  ssp126  ssp245  ssp370  ssp585
                      domain                                                                    
                      AFR-25               0           0       0       0       0       0       0
                      ANT-12               3           4       0       0       0       4       0
                      ARC-12               4           6       0       0       0       6       0
                      AUS-20i              4          34       0      34      20      34       4
                      CAM-12               1           8       0       8       8       0       8
                      CAS-12               0           2       0       2       0       2       2
                      EAS-25               0           5       0       5       5       5       5
                      EUR-12              19          59       9      58      18      55      21
                      MED-12              10           9       0       4       1       6       3
                      MED-25               1           1       0       0       0       0       0
                      MENA-25              1           1       0       1       1       1       1
                      NAM-12               1           8       0       4       4       8       1
                      NAM-25               1          15       0       0       0      15       0
                      SAM-25               1           1       0       0       0       1       0
                      SEA-12               1           3       0       0       0       3       0
                      SEA-25               3          13      10      13       3      13       3
                      WAS-25               2           5       0       1       4       2       4
ic| variable_count: priority   CORE  TIER1  TIER2
                    frequency                    
                    1hr          13     30      7
                    6hr           0     71     51
                    day          15    105     63
                    fx            2      0      7
                    mon          15    105     64
ic| variable_records_per_yr: priority     CORE   TIER1  TIER2
                             frequency                       
                             1hr        113880  262800  61320
                             6hr             0  103660  74460
                             day          5475   38325  22995
                             fx              0       0      0
                             mon           180    1260    768

/!\ Considering just ['CORE', 'TIER1', 'TIER2'] vars.)

ic| size_TB: experiment  evaluation  historical  ssp119  ssp126  ssp245  ssp370  ssp585
             domain                                                                    
             AFR-25             0.0         0.0     0.0     0.0     0.0     0.0     0.0
             ANT-12            39.2        68.9     0.0     0.0     0.0   109.7     0.0
             ARC-12            66.6       131.5     0.0     0.0     0.0   209.4     0.0
             AUS-20i           27.8       311.6     0.0   496.2   291.9   496.2    58.4
             CAM-12            25.6       269.7     0.0   429.5   429.5     0.0   429.5
             CAS-12             0.0        43.5     0.0    69.2     0.0    69.2    69.2
             EAS-25             0.0        44.1     0.0    70.3    70.3    70.3    70.3
             EUR-12           223.8       915.1   222.3  1432.7   444.6  1358.6   518.8
             MED-12            66.6        78.9     0.0    55.9    14.0    83.8    41.9
             MED-25             1.7         2.2     0.0     0.0     0.0     0.0     0.0
             MENA-25            7.4         9.7     0.0    15.5    15.5    15.5    15.5
             NAM-12            21.7       229.0     0.0   182.4   182.4   364.7    45.6
             NAM-25             5.4       107.3     0.0     0.0     0.0   171.0     0.0
             SAM-25             6.6         8.7     0.0     0.0     0.0    13.8     0.0
             SEA-12            13.8        54.6     0.0     0.0     0.0    86.9     0.0
             SEA-25            10.4        59.1    72.4    94.2    21.7    94.2    21.7
             WAS-25            13.5        44.6     0.0    14.2    56.8    28.4    56.8
ic| size_TB.T.sum(): domain
                     AFR-25        0.0
                     ANT-12      217.8
                     ARC-12      407.5
                     AUS-20i    1682.1
                     CAM-12     1583.8
                     CAS-12      251.1
                     EAS-25      325.3
                     EUR-12     5115.9
                     MED-12      341.1
                     MED-25        3.9
                     MENA-25      79.1
                     NAM-12     1025.8
                     NAM-25      283.7
                     SAM-25       29.1
                     SEA-12      155.3
                     SEA-25      373.7
                     WAS-25      214.3
                     dtype: float64

Total CORDEX-CMIP6 estimated size is: 12090 TB
gnikulin commented 1 month ago

Now it looks reasonable :-) as we need an estimate for all CORE, Tier1 and 2 variables.

gnikulin commented 1 month ago

Although I don't expect many variables from Tier2.

jesusff commented 1 month ago

Ok, this is the remaining one 😉 : CORE + Tier1

ic| simulation_count: experiment  evaluation  historical  ssp119  ssp126  ssp245  ssp370  ssp585
                      domain                                                                    
                      AFR-25               0           0       0       0       0       0       0
                      ANT-12               3           4       0       0       0       4       0
                      ARC-12               4           6       0       0       0       6       0
                      AUS-20i              4          34       0      34      20      34       4
                      CAM-12               1           8       0       8       8       0       8
                      CAS-12               0           2       0       2       0       2       2
                      EAS-25               0           5       0       5       5       5       5
                      EUR-12              19          59       9      58      18      55      21
                      MED-12              10           9       0       4       1       6       3
                      MED-25               1           1       0       0       0       0       0
                      MENA-25              1           1       0       1       1       1       1
                      NAM-12               1           8       0       4       4       8       1
                      NAM-25               1          15       0       0       0      15       0
                      SAM-25               1           1       0       0       0       1       0
                      SEA-12               1           3       0       0       0       3       0
                      SEA-25               3          13      10      13       3      13       3
                      WAS-25               2           5       0       1       4       2       4
ic| variable_count: priority   CORE  TIER1  TIER2
                    frequency                    
                    1hr          13     30      7
                    6hr           0     71     51
                    day          15    105     63
                    fx            2      0      7
                    mon          15    105     64
ic| variable_records_per_yr: priority     CORE   TIER1  TIER2
                             frequency                       
                             1hr        113880  262800  61320
                             6hr             0  103660  74460
                             day          5475   38325  22995
                             fx              0       0      0
                             mon           180    1260    768

/!\ Considering just ['CORE', 'TIER1'] vars.

ic| size_TB: experiment  evaluation  historical  ssp119  ssp126  ssp245  ssp370  ssp585
             domain                                                                    
             AFR-25             0.0         0.0     0.0     0.0     0.0     0.0     0.0
             ANT-12            30.1        52.9     0.0     0.0     0.0    84.2     0.0
             ARC-12            51.1       100.9     0.0     0.0     0.0   160.7     0.0
             AUS-20i           21.3       239.0     0.0   380.6   223.9   380.6    44.8
             CAM-12            19.6       206.9     0.0   329.5   329.5     0.0   329.5
             CAS-12             0.0        33.3     0.0    53.1     0.0    53.1    53.1
             EAS-25             0.0        33.9     0.0    53.9    53.9    53.9    53.9
             EUR-12           171.7       702.0   170.6  1099.1   341.1  1042.3   398.0
             MED-12            51.1        60.6     0.0    42.9    10.7    64.3    32.1
             MED-25             1.3         1.7     0.0     0.0     0.0     0.0     0.0
             MENA-25            5.7         7.5     0.0    11.9    11.9    11.9    11.9
             NAM-12            16.7       175.7     0.0   139.9   139.9   279.8    35.0
             NAM-25             4.2        82.4     0.0     0.0     0.0   131.2     0.0
             SAM-25             5.0         6.6     0.0     0.0     0.0    10.6     0.0
             SEA-12            10.6        41.9     0.0     0.0     0.0    66.7     0.0
             SEA-25             7.9        45.4    55.6    72.2    16.7    72.2    16.7
             WAS-25            10.4        34.2     0.0    10.9    43.5    21.8    43.5
ic| size_TB.T.sum(): domain
                     AFR-25        0.0
                     ANT-12      167.2
                     ARC-12      312.7
                     AUS-20i    1290.2
                     CAM-12     1215.0
                     CAS-12      192.6
                     EAS-25      249.5
                     EUR-12     3924.8
                     MED-12      261.7
                     MED-25        3.0
                     MENA-25      60.8
                     NAM-12      787.0
                     NAM-25      217.8
                     SAM-25       22.2
                     SEA-12      119.2
                     SEA-25      286.7
                     WAS-25      164.3
                     dtype: float64
Total CORDEX-CMIP6 estimated size is: 9275 TB
jesusff commented 1 month ago

Just for the record: some reasonable approximations are in place until all domains are defined in https://github.com/WCRP-CORDEX/domain-tables/blob/main/CORDEX-CMIP5_rotated_grids.csv https://github.com/WCRP-CORDEX/simulation-status/blob/bc8c5a3d6c6ae0af006ccde7cb422aa07749bf4a/storage_estimate.py#L30-L34

gnikulin commented 1 month ago

I think 10 PB sounds like an reasonable estimate. In any case, it's impossible to provide exact size :-)