Open rabernat opened 4 years ago
OK, that sounds like good advice. I'm assuming that removal of these variables is also something that can be done retroactively. Be sure to let me know if this not the case, or I will go ahead with the same procedure we've been using for now (since we have to go back anyway to fix the metadata for our other Zarr stores).
I'm assuming that removal of these variables is also something that can be done retroactively.
Should be as simple as deleting the directories for those variables and re-consolidating metadata.
You're currently wasting a non-negligible amount of space by storing all of these duplicate TAREA etc. variables in each of the ocean datasets.
It turns out that these variables consume ~20 MB per zarr store.
An even better option is to just drop all of the non-dimension coordinates before writing the zarr data, and then saving them to a standalone grid dataset, which can be brought in as needed for geometric calculations.
👍. Would this grid
dataset include static variables only? It appears that the LLC4320_grid includes time
as well. By static variables
I am referring to scalars and time-independent variables:
>>> print(grid_vars)
['hflux_factor', 'nsurface_u', 'DXU', 'latent_heat_vapor', 'salt_to_Svppt', 'DYT', 'TLONG', 'DYU', 'HTE', 'rho_air', 'HU', 'ULONG', 'DXT', 'rho_sw', 'HUS', 'HUW', 'moc_components', 'TAREA', 'ULAT', 'REGION_MASK', 'grav', 'transport_regions', 'KMU', 'sound', 'omega', 'ANGLET', 'HT', 'UAREA', 'heat_to_PW', 'days_in_norm_year', 'salt_to_ppt', 'dzw', 'sea_ice_salinity', 'cp_air', 'salt_to_mmday', 'dz', 'fwflux_factor', 'TLAT', 'HTN', 'mass_to_Sv', 'radius', 'latent_heat_fusion', 'T0_Kelvin', 'salinity_factor', 'sflux_factor', 'transport_components', 'KMT', 'rho_fw', 'cp_sw', 'ocn_ref_salinity', 'vonkar', 'nsurface_t', 'ANGLE', 'stefan_boltzmann', 'ppt_to_salt', 'momentum_factor']
Removing these grid variables produces a clean xarray dataset:
<xarray.Dataset>
Dimensions: (d2: 2, lat_aux_grid: 395, member_id: 40, moc_z: 61, nlat: 384, nlon: 320, time: 1872, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
* z_t (z_t) float32 500.0 1500.0 2500.0 ... 512502.8 537500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
* moc_z (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* time (time) object 1850-02-01 00:00:00 ... 2006-01-01 00:00:00
* member_id (member_id) int64 1 2 3 4 5 6 7 ... 34 35 101 102 103 104 105
Dimensions without coordinates: d2, nlat, nlon
Data variables:
time_bound (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
VVEL (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>
Attributes:
nsteps_total: 750
nco_openmp_thread_number: 1
cell_methods: cell_methods = time: mean ==> the variable val...
tavg_sum: 2592000.0
tavg_sum_qflux: 2592000.0
source: CCSM POP2, the CCSM Ocean Component
contents: Diagnostic and Prognostic Variables
@rabernat wrote:
The chunk choice on
WVEL
(and presumably other 3D variables) is, in my view, less than ideal... First, the chunks are on the large side (235.93 MB). Second, each vertical level is in a separate chunk, while 20 years of time are stored contiguously.
FYI, for the 3D atmospheric data (at least monthly Q) there each chunk contains all ensemble members, 12 months of data, and 2 levels:
<xarray.DataArray 'Q' (member_id: 40, time: 1032, lev: 30, lat: 192, lon: 288)> dask.array<zarr, shape=(40, 1032, 30, 192, 288), dtype=float32, chunksize=(40, 12, 2, 192, 288), chunktype=numpy.ndarray>
If we were to put all 30 levels in one chunk then we'd need to divide something else by a factor of ~15. Perhaps the x-y dimension should be 4x4 chunks instead of global?
I know Anderson was striving for 100MB chunks but haven't checked the size of these. The ocean data have, I think, 60 levels instead of 30, so the problem is even worse.
Also, @jhamman stated at the start of this project that it is possible to re-chunk under the hood if we don't like the arrangement, but I'm curious about how you do that in practice given the immutability of objects in an object store.
I also just opened the data and had a look. I agree with Ryan that rechunking so that each chunk contains all vertical levels would be very helpful: oceanographers like to plot sections! I don't object to chunking more in time in order to achieve this. I also think that it's sensible to continue chunking by memberID, because I will want to write and test my code for one member and then operate on all the members only once or twice. I'll probably hold off doing anything more until this is a bit more sorted out. Thanks to everyone for putting in this effort!
As an update I have re-chunked the data accordingly for all ocean variables:
<xarray.Dataset>
Dimensions: (d2: 2, lat_aux_grid: 395, member_id: 40, moc_z: 61, nlat: 384, nlon: 320, time: 1872, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
* z_t (z_t) float32 500.0 1500.0 2500.0 ... 512502.8 537500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
* moc_z (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* time (time) object 1850-02-01 00:00:00 ... 2006-01-01 00:00:00
* member_id (member_id) int64 1 2 3 4 5 6 7 ... 34 35 101 102 103 104 105
Dimensions without coordinates: d2, nlat, nlon
Data variables:
time_bound (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
VVEL (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>
As you can see, I removed the grid
variables. I could use some feedback on my comment above in https://github.com/NCAR/cesm-lens-aws/issues/34#issuecomment-612556759 regarding what needs to go into a standalone grid
dataset. The re-chunked data are residing on GLADE for now, and am ready to transfer them to S3 once the grid
dataset has been sorted out.
The re-chunked data are residing on GLADE for now, and am ready to transfer them to S3 once the grid dataset has been sorted out.
Does @jhamman have a strategy for re-chunking in place directly on AWS S3? I suspect this would require reading data from the old objects, creating the new objects in a separate bucket as scratch space, deleting the old objects, copying the new objects to the main bucket, deleting the new objects from the scratch bucket. I can create a scratch bucket under our AWS account if desired.
This is a minor nit, but I personally perfer time_bound
to also be in coords, not data_vars. Then you will just have one data variable per dataset, which has a nice, clean feel.
Also, there appear to be quite a few coordinates that are not used by the data variables. These could probably be removed as well.
I have created the Zarr files for TAUX and TAUY, but I chose to place all members in a single chunk because the chunks are so much smaller (these are 2D variables, so each chunk would be 1/60 the size of a 3D variable chunk).
But because I didn't perform the same metadata operations as @andersy005, and because they are fast to recreate, I will let Anderson make these also.
As an update, I updated the chunking scheme for all existing ocean variables on AWS-S3, removed the grid variables from the zarr stores, and created a standalone grid zarr store:
In [2]: import intake
...: url = 'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json'
...: col = intake.open_esm_datastore(url)
...: subset = col.search(component='ocn')
In [3]: subset.unique(columns=['variable', 'experiment', 'frequency'])
Out[3]:
{'variable': {'count': 11,
'values': ['SALT',
'SFWF',
'SHF',
'SSH',
'SST',
'TEMP',
'UVEL',
'VNS',
'VNT',
'VVEL',
'WVEL']},
'experiment': {'count': 3, 'values': ['20C', 'CTRL', 'RCP85']},
'frequency': {'count': 1, 'values': ['monthly']}}
In [1]: import s3fs
...: import xarray as xr
...:
...: fs = s3fs.S3FileSystem(anon=True)
...: s3_path = 's3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-WVEL.zarr'
...: ds = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
...: ds
Out[1]:
<xarray.Dataset>
Dimensions: (d2: 2, member_id: 1, nlat: 384, nlon: 320, time: 21612, z_w_top: 60)
Coordinates:
* member_id (member_id) int64 1
* time (time) object 0400-02-01 00:00:00 ... 2201-01-01 00:00:00
time_bound (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
* z_w_top (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
Dimensions without coordinates: d2, nlat, nlon
Data variables:
WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>
Attributes:
Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc...
NCO: 4.3.4
calendar: All years have exactly 365 days.
cell_methods: cell_methods = time: mean ==> the variable val...
contents: Diagnostic and Prognostic Variables
nco_openmp_thread_number: 1
revision: $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy...
source: CCSM POP2, the CCSM Ocean Component
tavg_sum: 2678400.0
tavg_sum_qflux: 2678400.0
title: b.e11.B1850C5CN.f09_g16.005
In [2]: s3_path = 's3://ncar-cesm-lens/ocn/grid.zarr'
In [3]: grid = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
In [6]: xr.merge([ds, grid])
Out[6]:
<xarray.Dataset>
Dimensions: (d2: 2, lat_aux_grid: 395, member_id: 1, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 21612, transport_comp: 5, transport_reg: 2, z_t: 1, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
* member_id (member_id) int64 1
* time (time) object 0400-02-01 00:00:00 ... 2201-01-01 00:00:00
time_bound (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
* z_w_top (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94
ANGLE (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
ANGLET (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
DXT (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
DXU (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
DYT (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
DYU (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
HT (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
HTE (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
HTN (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
HU (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
HUS (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
HUW (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
KMT (nlat, nlon) float64 dask.array<chunksize=(192, 320), meta=np.ndarray>
KMU (nlat, nlon) float64 dask.array<chunksize=(192, 320), meta=np.ndarray>
REGION_MASK (nlat, nlon) float64 dask.array<chunksize=(192, 320), meta=np.ndarray>
T0_Kelvin float64 ...
TAREA (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
TLAT (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
TLONG (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
UAREA (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
ULAT (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
ULONG (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
cp_air float64 ...
cp_sw float64 ...
days_in_norm_year timedelta64[ns] ...
dz (z_t) float32 dask.array<chunksize=(1,), meta=np.ndarray>
dzw (z_w) float32 dask.array<chunksize=(60,), meta=np.ndarray>
fwflux_factor float64 ...
grav float64 ...
heat_to_PW float64 ...
hflux_factor float64 ...
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0
latent_heat_fusion float64 ...
latent_heat_vapor float64 ...
mass_to_Sv float64 ...
moc_components (moc_comp) |S256 dask.array<chunksize=(3,), meta=np.ndarray>
* moc_z (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06
momentum_factor float64 ...
nsurface_t float64 ...
nsurface_u float64 ...
ocn_ref_salinity float64 ...
omega float64 ...
ppt_to_salt float64 ...
radius float64 ...
rho_air float64 ...
rho_fw float64 ...
rho_sw float64 ...
salinity_factor float64 ...
salt_to_Svppt float64 ...
salt_to_mmday float64 ...
salt_to_ppt float64 ...
sea_ice_salinity float64 ...
sflux_factor float64 ...
sound float64 ...
stefan_boltzmann float64 ...
transport_components (transport_comp) |S256 dask.array<chunksize=(5,), meta=np.ndarray>
transport_regions (transport_reg) |S256 dask.array<chunksize=(2,), meta=np.ndarray>
vonkar float64 ...
* z_t (z_t) float32 500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 ... 549999.06
Dimensions without coordinates: d2, moc_comp, nlat, nlon, transport_comp, transport_reg
Data variables:
WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>
As an update, I updated the chunking scheme for all existing ocean variables on AWS-S3, removed the grid variables from the zarr stores, and created a standalone grid zarr store
@andersy005 Did you have to create the new Zarr on GLADE and then delete/upload/replace the Zarr stores on S3, or was it possible to re-chunk in place on AWS?
I am updating the dataset landing page to include the new variables.
QUESTION: We added VNS & VNT (salt and heat fluxes in y-direction). Shouldn't we also include UES & UET (salt and heat fluxes in x-direction), and maybe WTS & WTT (fluxes across top face)? I don't see how only one component of the flux vectors can be useful.
I am updating the dataset landing page to include the new variables.
Hi Jeff, those variables are actually in transit now. I was going to announce their availability for performance testing after the transfer was completed. Once they have been transferred, I will update the catalog for AWS users. The variables in transit are:
3D variables: DIC, DOC, UES, UET, WTS, WTT, PD
2D variables: TAUX, TAUY, TAUX2, TAUY2, QFLUX, FW, HMXL, QSW_HTP, QSW_HBL, SHF_QSW, SFWF_WRST, RESID_S, RESID_T
It has been an uphill climb to understand the difficulties of creating very large Zarr stores; the Dask workers were bogging down and crashing at first, but eventually I began understanding what configurations would lead to successful Zarr saves.
@bonnland Excellent! Thank you very much. I will update the landing page to include those (but not publish until you are ready).
FYI the draft unpublished landing page with recent updates is temporarily at CESM_LENS_on_AWS.20200428.htm
@cspencerjones @rabernat @jbusecke Transfer of new ocean data is complete and available on Amazon AWS. It would be very helpful if someone could try a nontrivial computation with the data to make sure performance based on our chunking scheme is adequate.
I've confirmed that the Binder notebook on Amazon works (see the README.md for the link), and the variables are visible in the catalog. Here is what I got:
import intake
intakeEsmUrl = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(intakeEsmUrl)
subset = col.search(component='ocn')
subset.unique(columns=['variable', 'experiment', 'frequency'])
{'variable': {'count': 32,
'values': ['DIC',
'DOC',
'FW',
'HMXL',
'O2',
'PD',
'QFLUX',
'QSW_HBL',
'QSW_HTP',
'RESID_S',
'RESID_T',
'SALT',
'SFWF',
'SFWF_WRST',
'SHF',
'SHF_QSW',
'SSH',
'SST',
'TAUX',
'TAUX2',
'TAUY',
'TAUY2',
'TEMP',
'UES',
'UET',
'UVEL',
'VNS',
'VNT',
'VVEL',
'WTS',
'WTT',
'WVEL']},
'experiment': {'count': 3, 'values': ['20C', 'CTRL', 'RCP85']},
'frequency': {'count': 1, 'values': ['monthly']}}
I tried a few thing with the data this morning, including calculating density from temperature and salinity and plotting sections, transforming some variables to density coordinates and plotting time means etc. I tried using multiple workers as well. This worked ok and I think that the performance is adequate.
That's great to hear; we can tentatively move forward with the remaining variables requested so far. They are all 3D variables:
UVEL2, VVEL2 HDIFB_SALT, HDIFB_TEMP, HDIFE_SALT, HDIFE_TEMP HDIFN_SALT, HDIFN_TEMP KAPPA_ISOP, KAPPA_THIC KPP_SRC_SALT, KPP_SRC_TEMP VNT_ISOP, VNT_SUBM HOR_DIFF
I've spent some time looking at MOC, which has a different parameterization than the other variables. Any thoughts on chunking are appreciated. At first glance, it seems we want to chunk in time, and leave all other dimensions unchunked, aiming for a chunk size between 100 and 200 MB.
netcdf b.e11.B20TRLENS_RCP85.f09_g16.xbmb.010.pop.h.MOC.192001-202912 {
dimensions:
d2 = 2 ;
time = UNLIMITED ; // (1320 currently)
moc_comp = 3 ;
transport_comp = 5 ;
transport_reg = 2 ;
lat_aux_grid = 395 ;
moc_z = 61 ;
nlon = 320 ;
nlat = 384 ;
float MOC(time, transport_reg, moc_comp, moc_z, lat_aux_grid) ;
MOC:_FillValue = 9.96921e+36f ;
MOC:long_name = "Meridional Overturning Circulation" ;
MOC:units = "Sverdrups" ;
MOC:coordinates = "lat_aux_grid moc_z moc_components transport_region time" ;
MOC:cell_methods = "time: mean" ;
MOC:missing_value = 9.96921e+36f ;
FYI the draft unpublished landing page with recent updates is temporarily at CESM_LENS_on_AWS.20200428.htm
Now that the new data have been uploaded, I believe I can publish this draft as the new landing page. QUESTION: Does the page need to say anything about the new approach to repeated grid variables, or is that completely transparent to the user?
QUESTION: Does the page need to say anything about the new approach to repeated grid variables, or is that completely transparent to the user?
There are still small inconsistencies to work out, AFAIK. Unless I am mistaken, Anderson republished all the ocean data the grid variables removed, but grid variables still coexist in the atmospheric data, and these grid variables are probably distinct from the ocean variables.
The separate grid variables have been pushed to AWS, but they don't quite fit yet into our catalog framework, which is not yet general enough to handle variables that extend across experiments (CTRL, 20C, RCP85, etc). So the user can't load the grid variables until we generalize the catalog logic to make them available.
And I'm not yet clear on whether transparent loading of these variables is a simple matter. Simpler from a data provider engineering perspective would be to modify the Kay notebook to show how grid variables are loaded for area-based computations, which would require republishing the atmosphere variables. So, some kinks are left to work out.
I started to look at the LENS AWS data. I discovered there is very little available
There are only 3 variables: SALT (3D), SSH (2D), and SST (2D).
At minimum, I would also like to have THETA (3D), UVEL (3D), VVEL (3D), and WVEL (3D), and all the surface fluxes of heat and freshwater. Beyond that, it would be ideal to also have the necessary variables to reconstruct the tracer and momentum budgets.
Are there plans to add more data?