Open jesusff opened 1 month ago
This is a first estimate, only for CORE variables and with an assumed average 60% compression rate wrt to raw binary float precision. This value can be adjusted (e.g. by comparing with real processed output).
ic| simulation_count: experiment evaluation historical ssp119 ssp126 ssp245 ssp370 ssp585
domain
AFR-25 0 0 0 0 0 0 0
ANT-12 3 4 0 0 0 4 0
ARC-12 4 6 0 0 0 6 0
AUS-20i 4 34 0 34 20 34 4
CAM-12 1 8 0 8 8 0 8
CAS-12 0 2 0 2 0 2 2
EAS-25 0 5 0 5 5 5 5
EUR-12 19 59 9 58 18 55 21
MED-12 10 9 0 4 1 6 3
MED-25 1 1 0 0 0 0 0
MENA-25 1 1 0 1 1 1 1
NAM-12 1 8 0 4 4 8 1
NAM-25 1 15 0 0 0 15 0
SAM-25 1 1 0 0 0 1 0
SEA-12 1 3 0 0 0 3 0
SEA-25 3 13 10 13 3 13 3
WAS-25 2 5 0 1 4 2 4
ic| variable_count: priority CORE TIER1 TIER2
frequency
1hr 13 30 7
6hr 0 71 51
day 15 105 63
fx 2 0 7
mon 15 105 64
ic| variable_records_per_yr: priority CORE TIER1 TIER2
frequency
1hr 113880 262800 61320
6hr 0 103660 74460
day 5475 38325 22995
fx 0 0 0
mon 180 1260 768
ic| size_TB: experiment evaluation historical ssp119 ssp126 ssp245 ssp370 ssp585
domain
AFR-25 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ANT-12 6.8 12.0 0.0 0.0 0.0 19.1 0.0
ARC-12 11.6 22.9 0.0 0.0 0.0 36.5 0.0
AUS-20i 4.9 54.4 0.0 86.6 50.9 86.6 10.2
CAM-12 4.5 47.1 0.0 74.9 74.9 0.0 74.9
CAS-12 0.0 7.6 0.0 12.1 0.0 12.1 12.1
EAS-25 0.0 7.7 0.0 12.3 12.3 12.3 12.3
EUR-12 39.0 159.7 38.8 250.0 77.6 237.0 90.5
MED-12 11.6 13.8 0.0 9.7 2.4 14.6 7.3
MED-25 0.3 0.4 0.0 0.0 0.0 0.0 0.0
MENA-25 1.3 1.7 0.0 2.7 2.7 2.7 2.7
NAM-12 3.8 40.0 0.0 31.8 31.8 63.6 8.0
NAM-25 0.9 18.7 0.0 0.0 0.0 29.8 0.0
SAM-25 1.1 1.5 0.0 0.0 0.0 2.4 0.0
SEA-12 2.4 9.5 0.0 0.0 0.0 15.2 0.0
SEA-25 1.8 10.3 12.6 16.4 3.8 16.4 3.8
WAS-25 2.4 7.8 0.0 2.5 9.9 5.0 9.9
ic| size_TB.T.sum(): domain
AFR-25 0.0
ANT-12 37.9
ARC-12 71.0
AUS-20i 293.6
CAM-12 276.3
CAS-12 43.9
EAS-25 56.9
EUR-12 892.6
MED-12 59.4
MED-25 0.7
MENA-25 13.8
NAM-12 179.0
NAM-25 49.4
SAM-25 5.0
SEA-12 27.1
SEA-25 65.1
WAS-25 37.5
dtype: float64
/!\ Considering just ['CORE'] vars.)
Total CORDEX-CMIP6 estimated size is: 2109 TB
I think such an approach should work as an basic estimate. RCMs with non rotated grids usually provide larger domains but some RCM groups may provide only a subset of the Tier1 and 2 variables.
Are all simulations for EUR-12 only 892.6 Tb ?
Yes, but this is only CORE variables and only those simulations planned so far in this repo. I paste here the summary including all tiers:
ic| simulation_count: experiment evaluation historical ssp119 ssp126 ssp245 ssp370 ssp585
domain
AFR-25 0 0 0 0 0 0 0
ANT-12 3 4 0 0 0 4 0
ARC-12 4 6 0 0 0 6 0
AUS-20i 4 34 0 34 20 34 4
CAM-12 1 8 0 8 8 0 8
CAS-12 0 2 0 2 0 2 2
EAS-25 0 5 0 5 5 5 5
EUR-12 19 59 9 58 18 55 21
MED-12 10 9 0 4 1 6 3
MED-25 1 1 0 0 0 0 0
MENA-25 1 1 0 1 1 1 1
NAM-12 1 8 0 4 4 8 1
NAM-25 1 15 0 0 0 15 0
SAM-25 1 1 0 0 0 1 0
SEA-12 1 3 0 0 0 3 0
SEA-25 3 13 10 13 3 13 3
WAS-25 2 5 0 1 4 2 4
ic| variable_count: priority CORE TIER1 TIER2
frequency
1hr 13 30 7
6hr 0 71 51
day 15 105 63
fx 2 0 7
mon 15 105 64
ic| variable_records_per_yr: priority CORE TIER1 TIER2
frequency
1hr 113880 262800 61320
6hr 0 103660 74460
day 5475 38325 22995
fx 0 0 0
mon 180 1260 768
/!\ Considering just ['CORE', 'TIER1', 'TIER2'] vars.)
ic| size_TB: experiment evaluation historical ssp119 ssp126 ssp245 ssp370 ssp585
domain
AFR-25 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ANT-12 39.2 68.9 0.0 0.0 0.0 109.7 0.0
ARC-12 66.6 131.5 0.0 0.0 0.0 209.4 0.0
AUS-20i 27.8 311.6 0.0 496.2 291.9 496.2 58.4
CAM-12 25.6 269.7 0.0 429.5 429.5 0.0 429.5
CAS-12 0.0 43.5 0.0 69.2 0.0 69.2 69.2
EAS-25 0.0 44.1 0.0 70.3 70.3 70.3 70.3
EUR-12 223.8 915.1 222.3 1432.7 444.6 1358.6 518.8
MED-12 66.6 78.9 0.0 55.9 14.0 83.8 41.9
MED-25 1.7 2.2 0.0 0.0 0.0 0.0 0.0
MENA-25 7.4 9.7 0.0 15.5 15.5 15.5 15.5
NAM-12 21.7 229.0 0.0 182.4 182.4 364.7 45.6
NAM-25 5.4 107.3 0.0 0.0 0.0 171.0 0.0
SAM-25 6.6 8.7 0.0 0.0 0.0 13.8 0.0
SEA-12 13.8 54.6 0.0 0.0 0.0 86.9 0.0
SEA-25 10.4 59.1 72.4 94.2 21.7 94.2 21.7
WAS-25 13.5 44.6 0.0 14.2 56.8 28.4 56.8
ic| size_TB.T.sum(): domain
AFR-25 0.0
ANT-12 217.8
ARC-12 407.5
AUS-20i 1682.1
CAM-12 1583.8
CAS-12 251.1
EAS-25 325.3
EUR-12 5115.9
MED-12 341.1
MED-25 3.9
MENA-25 79.1
NAM-12 1025.8
NAM-25 283.7
SAM-25 29.1
SEA-12 155.3
SEA-25 373.7
WAS-25 214.3
dtype: float64
Total CORDEX-CMIP6 estimated size is: 12090 TB
Now it looks reasonable :-) as we need an estimate for all CORE, Tier1 and 2 variables.
Although I don't expect many variables from Tier2.
Ok, this is the remaining one 😉 : CORE + Tier1
ic| simulation_count: experiment evaluation historical ssp119 ssp126 ssp245 ssp370 ssp585
domain
AFR-25 0 0 0 0 0 0 0
ANT-12 3 4 0 0 0 4 0
ARC-12 4 6 0 0 0 6 0
AUS-20i 4 34 0 34 20 34 4
CAM-12 1 8 0 8 8 0 8
CAS-12 0 2 0 2 0 2 2
EAS-25 0 5 0 5 5 5 5
EUR-12 19 59 9 58 18 55 21
MED-12 10 9 0 4 1 6 3
MED-25 1 1 0 0 0 0 0
MENA-25 1 1 0 1 1 1 1
NAM-12 1 8 0 4 4 8 1
NAM-25 1 15 0 0 0 15 0
SAM-25 1 1 0 0 0 1 0
SEA-12 1 3 0 0 0 3 0
SEA-25 3 13 10 13 3 13 3
WAS-25 2 5 0 1 4 2 4
ic| variable_count: priority CORE TIER1 TIER2
frequency
1hr 13 30 7
6hr 0 71 51
day 15 105 63
fx 2 0 7
mon 15 105 64
ic| variable_records_per_yr: priority CORE TIER1 TIER2
frequency
1hr 113880 262800 61320
6hr 0 103660 74460
day 5475 38325 22995
fx 0 0 0
mon 180 1260 768
/!\ Considering just ['CORE', 'TIER1'] vars.
ic| size_TB: experiment evaluation historical ssp119 ssp126 ssp245 ssp370 ssp585
domain
AFR-25 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ANT-12 30.1 52.9 0.0 0.0 0.0 84.2 0.0
ARC-12 51.1 100.9 0.0 0.0 0.0 160.7 0.0
AUS-20i 21.3 239.0 0.0 380.6 223.9 380.6 44.8
CAM-12 19.6 206.9 0.0 329.5 329.5 0.0 329.5
CAS-12 0.0 33.3 0.0 53.1 0.0 53.1 53.1
EAS-25 0.0 33.9 0.0 53.9 53.9 53.9 53.9
EUR-12 171.7 702.0 170.6 1099.1 341.1 1042.3 398.0
MED-12 51.1 60.6 0.0 42.9 10.7 64.3 32.1
MED-25 1.3 1.7 0.0 0.0 0.0 0.0 0.0
MENA-25 5.7 7.5 0.0 11.9 11.9 11.9 11.9
NAM-12 16.7 175.7 0.0 139.9 139.9 279.8 35.0
NAM-25 4.2 82.4 0.0 0.0 0.0 131.2 0.0
SAM-25 5.0 6.6 0.0 0.0 0.0 10.6 0.0
SEA-12 10.6 41.9 0.0 0.0 0.0 66.7 0.0
SEA-25 7.9 45.4 55.6 72.2 16.7 72.2 16.7
WAS-25 10.4 34.2 0.0 10.9 43.5 21.8 43.5
ic| size_TB.T.sum(): domain
AFR-25 0.0
ANT-12 167.2
ARC-12 312.7
AUS-20i 1290.2
CAM-12 1215.0
CAS-12 192.6
EAS-25 249.5
EUR-12 3924.8
MED-12 261.7
MED-25 3.0
MENA-25 60.8
NAM-12 787.0
NAM-25 217.8
SAM-25 22.2
SEA-12 119.2
SEA-25 286.7
WAS-25 164.3
dtype: float64
Total CORDEX-CMIP6 estimated size is: 9275 TB
Just for the record: some reasonable approximations are in place until all domains are defined in https://github.com/WCRP-CORDEX/domain-tables/blob/main/CORDEX-CMIP5_rotated_grids.csv https://github.com/WCRP-CORDEX/simulation-status/blob/bc8c5a3d6c6ae0af006ccde7cb422aa07749bf4a/storage_estimate.py#L30-L34
I think 10 PB sounds like an reasonable estimate. In any case, it's impossible to provide exact size :-)
In order to secure the archival of CMIP6-driven simulations in ESGF, we first need to estimate the volume of data that we will be producing. An accurate estimate is difficult due to the dependence of the compression rates on the different variables and even on the different models as the effective model resolution affects compression.
A very rough estimate should, in principle, be easily extracted by combining the information in: