Where are all the ocean variables?

rabernat commented 4 years ago

I started to look at the LENS AWS data. I discovered there is very little available

import intake_esm
import intake
url = 'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(url)
col.search(component='ocn').df

    component   frequency   experiment  variable    path
0   ocn monthly 20C SALT    s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-SAL...
1   ocn monthly 20C SSH s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-SSH...
2   ocn monthly 20C SST s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-SST...
3   ocn monthly CTRL    SALT    s3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-SA...
4   ocn monthly CTRL    SSH s3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-SS...
5   ocn monthly CTRL    SST s3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-SS...
6   ocn monthly RCP85   SALT    s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-S...
7   ocn monthly RCP85   SSH s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-S...
8   ocn monthly RCP85   SST s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-S...

There are only 3 variables: SALT (3D), SSH (2D), and SST (2D).

At minimum, I would also like to have THETA (3D), UVEL (3D), VVEL (3D), and WVEL (3D), and all the surface fluxes of heat and freshwater. Beyond that, it would be ideal to also have the necessary variables to reconstruct the tracer and momentum budgets.

Are there plans to add more data?

jeffdlb commented 4 years ago

Ryan-

You are correct there is not much ocean data there. The plan was to add additional data on request, so thanks for your request. We do have room in our AWS allocation to add more, and I would be glad for us to do so. I will discuss this with some team members at our meeting on Thursday. Besides the THETA, UVEL, VVEL, WVEL fields you mentioned, can you indicate which specific other variables you desire? Gary Strand's list of variable names is at http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html

rabernat commented 4 years ago

Thanks @jeffdlb! I will review the list of fields and get back to you. Our general interest is ocean heat and salt budgets.

jeffdlb commented 4 years ago

PS. We currently only have monthly ocean data, whereas the other realms also have some daily or 6-hour data. Is monthly sufficient for your use case?

rabernat commented 4 years ago

For coarse-resolution (non eddy-resolving) models, the oceans tend not to have too much sub-monthly variability. If we did need daily data, it would just be surface fluxes. Monthly should be fine for the other stuff.

rabernat commented 4 years ago

Ok, here is my best guess at identifying the variables we would need for the heat and salt budgets. Would be good for someone with more POP experience (e.g. @matt-long) to verify.

THETA
UVEL
UVEL2
VVEL
VVEL2
WVEL
FW
HDIFB_SALT
HDIFB_TEMP
HDIFE_SALT
HDIFE_TEMP
HDIFN_SALT
HDIFN_TEMP
HMXL
HOR_DIFF
KAPPA_ISOP
KAPPA_THIC
KPP_SRC_SALT
KPP_SRC_TEMP
RESID_S
RESID_T
QFLUX
SHF
SHF_QSW
SFWF
SFWF_WRST
SSH
TAUX
TAUX2
TAUY
TAUY2
VNT_ISOP
VNT_SUBM
UES
UET
VNS
VNT
WTS
WTT

jeffdlb commented 4 years ago

Ryan-

Many of these variables do not seem present in the monthly ocean data on GLADE:

1310> pwd /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries 1311> while read ln; do ls -d "monthly/$ln"; done < ~/oceanVars.txt ls: cannot access monthly/THETA: No such file or directory monthly/UVEL ls: cannot access monthly/UVEL2: No such file or directory monthly/VVEL ls: cannot access monthly/VVEL2: No such file or directory monthly/WVEL ls: cannot access monthly/FW: No such file or directory ls: cannot access monthly/HDIFB_SALT: No such file or directory ls: cannot access monthly/HDIFB_TEMP: No such file or directory ls: cannot access monthly/HDIFE_SALT: No such file or directory ls: cannot access monthly/HDIFE_TEMP: No such file or directory ls: cannot access monthly/HDIFN_SALT: No such file or directory ls: cannot access monthly/HDIFN_TEMP: No such file or directory monthly/HMXL ls: cannot access monthly/HOR_DIFF: No such file or directory ls: cannot access monthly/KAPPA_ISOP: No such file or directory ls: cannot access monthly/KAPPA_THIC: No such file or directory ls: cannot access monthly/KPP_SRC_SALT: No such file or directory ls: cannot access monthly/KPP_SRC_TEMP: No such file or directory ls: cannot access monthly/RESID_S: No such file or directory ls: cannot access monthly/RESID_T: No such file or directory monthly/QFLUX monthly/SHF monthly/SHF_QSW monthly/SFWF ls: cannot access monthly/SFWF_WRST: No such file or directory monthly/SSH monthly/TAUX monthly/TAUX2 monthly/TAUY monthly/TAUY2 ls: cannot access monthly/VNT_ISOP: No such file or directory ls: cannot access monthly/VNT_SUBM: No such file or directory monthly/UES ls: cannot access monthly/UET: No such file or directory monthly/VNS ls: cannot access monthly/VNT: No such file or directory monthly/WTS ls: cannot access monthly/WTT: No such file or directory

Jeff de La Beaujardiere, PhD Director, NCAR/CISL Information Systems Division https://staff.ucar.edu/users/jeffdlb https://orcid.org/0000-0002-1001-9210

On Wed, Feb 12, 2020 at 1:46 PM Ryan Abernathey notifications@github.com wrote:

Ok, here is my best guess at identifying the variables we would need for the heat and salt budgets. Would be good for someone with more POP experience (e.g. @matt-long https://github.com/matt-long) to verify.

THETA UVEL UVEL2 VVEL VVEL2 WVEL FW HDIFB_SALT HDIFB_TEMP HDIFE_SALT HDIFE_TEMP HDIFN_SALT HDIFN_TEMP HMXL HOR_DIFF KAPPA_ISOP KAPPA_THIC KPP_SRC_SALT KPP_SRC_TEMP RESID_S RESID_T QFLUX SHF SHF_QSW SFWF SFWF_WRST SSH TAUX TAUX2 TAUY TAUY2 VNT_ISOP VNT_SUBM UES UET VNS VNT WTS WTT

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCAR/cesm-lens-aws/issues/34?email_source=notifications&email_token=ABF4W4V7YH2OHFOZ3IW4DLTRCRNY3A5CNFSM4KPJYHB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELSKIBY#issuecomment-585409543, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF4W4TKKWSQ4WIUUTK5N7TRCRNY3ANCNFSM4KPJYHBQ .

matt-long commented 4 years ago

TEMP is the POP variable name for potential temperature (not THETA).

Many of the data not available on glade are available on HPSS

And (possibly) on NCAR Campaign storage here /glade/campaign/cesm/collections/cesmLE (accessible on Casper)

Note that to close a tracer budget you need Advection UE{S,T} VN{S,T} WT{S,T}

Lateral diffusion (GM, submeso) HDIF{E,N,B}_{S,T}

Vertical mixing KPPSRC{SALT,TEMP} DIAIMPVF{SALT,TEMP} # implicit diabatic vertical mixing, I think you missed this one

Surface fluxes # some choices SHF_QSW # I don't think we save fully 3D QSW, unfortunately QSW_HTP # top layer QSW_HBL # boundary layer SHF QFLUX SFWF

Inventory TEMP SALT SSH # free-surface deviations impact the tracer mass in top layer, where dz = dz + SSH

bonnland commented 4 years ago

According to Gary Strand, all of the LENS data is available on Glade here:

/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE

When I look at what is available for the ocean, I find this:

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls annual/
DIA_IMPVF_DIC/      HDIFB_DOC/      HDIFE_O2/       J_Fe/             KPP_SRC_Fe/      VN_DIC/      WT_DOC/
DIA_IMPVF_DIC_ALT_CO2/  HDIFB_Fe/       HDIFN_DIC/      J_NH4/            KPP_SRC_O2/      VN_DIC_ALT_CO2/  WT_Fe/
DIA_IMPVF_DOC/      HDIFB_O2/       HDIFN_DIC_ALT_CO2/  J_NO3/            UE_DIC/          VN_DOC/      WT_O2/
DIA_IMPVF_Fe/       HDIFE_DIC/      HDIFN_DOC/      J_PO4/            UE_DIC_ALT_CO2/  VN_Fe/
DIA_IMPVF_O2/       HDIFE_DIC_ALT_CO2/  HDIFN_Fe/       J_SiO3/           UE_DOC/          VN_O2/
HDIFB_DIC/      HDIFE_DOC/      HDIFN_O2/       KPP_SRC_DIC/          UE_Fe/           WT_DIC/
HDIFB_DIC_ALT_CO2/  HDIFE_Fe/       J_ALK/      KPP_SRC_DIC_ALT_CO2/  UE_O2/           WT_DIC_ALT_CO2/

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls monthly
ALK/           DIC_ALT_CO2/       get         MELTH_F/         photoC_diat/  SENH_F/     TAUY/                VISOP/
ATM_CO2/       DOC/           HBLT/       MOC/         photoC_diaz/  SFWF/   TAUY2/               VNS/
BSF/           DpCO2/         HMXL/       N_HEAT/          photoC_sp/    SHF/    TBLT/                VVEL/
CFC11/         DpCO2_ALT_CO2/     IAGE/       N_SALT/          PREC_F/       SHF_QSW/    TEMP/                WISOP/
CFC_ATM_PRESS/     ECOSYS_ATM_PRESS/  IFRAC/          O2/          QFLUX/        SNOW_F/     tend_zint_100m_ALK/          WTS/
CFC_IFRAC/     ECOSYS_IFRAC/      INT_DEPTH/      O2_CONSUMPTION/  QSW_HBL/      spCaCO3/    tend_zint_100m_DIC/          WVEL/
CFC_XKW/       ECOSYS_XKW/        IOFF_F/         O2_PRODUCTION/   QSW_HTP/      spChl/  tend_zint_100m_DIC_ALT_CO2/  XBLT/
CO2STAR/       EVAP_F/        Jint_100m_ALK/  O2SAT/           RHO/      SSH/    tend_zint_100m_DOC/          XMXL/
DCO2STAR/      FG_ALT_CO2/        Jint_100m_DIC/  O2_ZMIN/         RHO_VINT/     SSH2/   tend_zint_100m_O2/       zsatarag/
DCO2STAR_ALT_CO2/  FG_CO2/        Jint_100m_DOC/  O2_ZMIN_DEPTH/   ROFF_F/       SST/    TLT/                 zsatcalc/
DIA_DEPTH/     FvICE_ALK/         Jint_100m_O2/   pCO2SURF/        SALT/         STF_CFC11/  TMXL/
diatChl/       FvICE_DIC/         LWDN_F/         PD/          SALT_F/       STF_O2/     UES/
diazChl/       FvPER_ALK/         LWUP_F/         PH/          SCHMIDT_CO2/  TAUX/   UISOP/
DIC/           FvPER_DIC/         MELT_F/         PH_ALT_CO2/      SCHMIDT_O2/   TAUX2/  UVEL/

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls daily
CaCO3_form_zint/  diazC_zint_100m/  ECOSYS_XKW_2/  nday1/         spCaCO3_zint_100m/  SST/       TAUY_2/    zooC_zint_100m/
diatChl_SURF/     DpCO2_2/      FG_CO2_2/      photoC_diat_zint/  spChl_SURF/     SST2/      WVEL_50m/
diatC_zint_100m/  ecosys/       HBLT_2/    photoC_diaz_zint/  spC_zint_100m/      STF_O2_2/  XBLT_2/
diazChl_SURF/     ECOSYS_IFRAC_2/   HMXL_2/    photoC_sp_zint/    SSH_2/          TAUX_2/    XMXL_2/

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$

If someone can help decipher these variables and determine if they are worth publishing, I would be happy to work on getting them onto AWS.

bonnland commented 4 years ago

@jeffdlb @rabernat On second glance, it appears that my directory listing above shows that some variables are missing. I'm waiting to hear from Gary to get some clarification.

rabernat commented 4 years ago

@bonnland -- were you the one who produced the original S3 LENS datasets? If so, it would be nice to build on that effort. My impression from @jeffdlb is that they have a pipeline set up, they just need to find the data! Maybe you were part of that...sorry for my ignorance.

As for the missing variables, I guess I would just request that you take the intersection between my requested list and what is actually available. I think that the list of monthly and daily variables you showed above is a great start. I would use nearly all of it.

bonnland commented 4 years ago

@rabernat I just got word from Gary; I was originally in a slightly different folder. I have the correct folder path now, and all 273 monthly ocean variables appear to be present.

I was part of the original data publishing, so I know parts of the workflow. The most time consuming part is creating the CSV file describing an intake-esm catalog, which I did not originally take part in. The catalog is used to load data into xarray and then write out to Zarr. We have the file paths now; I just need to research how to construct the remaining fields for the CSV file.

bonnland commented 4 years ago

@rabernat I've loaded some variables, and the datasets are big. A single variable will take over 2TB. Here are some stats for five of the variables:

Screen Shot 2020-03-05 at 3 28 47 PM

Note that these sizes are uncompressed sizes, and they will be smaller on disk.

Is there a priority ordering that makes sense if we can initially publish just a subset? Anderson believes that if the available space on AWS has not changed, we have around 30 TB available.

@jeffdlb Do you know more exactly how much space is left on S3, and when we might get more?

jeffdlb commented 4 years ago

Brian-

On Thu, Mar 5, 2020 at 5:06 PM bonnland notifications@github.com wrote:

@jeffdlb https://github.com/jeffdlb Do you know more exactly how much space is left on S3, and when we might get more?

We are currently using 61.5 TB for 905023 objects in ncar-cesm-lens bucket, the vast majority of which is for atmosphere data. Ocean data only use 1.2 TB at present.

I don't think there is an automatically-enforced limit to the allocation, so nothing will prevent writing objects after 100 TB. However, we should as a courtesy notify AWS if we plan to go over. They have already said we can use more if needed, within reason.

Is 20.95TB the total for all the new ocean variables, or only for a subset? If subset, can you estimate the total (uncompressed) for all the new vars?

-Jeff

Jeff de La Beaujardiere, PhD Director, NCAR/CISL Information Systems Division https://staff.ucar.edu/users/jeffdlb https://orcid.org/0000-0002-1001-9210

andersy005 commented 4 years ago

Is 20.95TB the total for all the new ocean variables, or only for a subset? If subset, can you estimate the total (uncompressed) for all the new vars?

The 20.95 TB is for only 5 variables.

Of the 39 variables listed in https://github.com/NCAR/cesm-lens-aws/issues/34#issuecomment-585409543, we found 38 variables. A back of the envelope calculation shows that their total uncompressed size would be ~ 170TB.

rabernat commented 4 years ago

Do we have any idea what typical zarr + zlib compression rates are for these datasets? I would not be surprised to see a factor for 2 or more.

jeffdlb commented 4 years ago

@rabernat The one data point I have so far is for atm/monthly/cesmLE-RCP85-TREFHT.zarr: 5.5 GB storage 10.1 GB uncompressed

jeffdlb commented 4 years ago

I will ask Joe & Ana whether we can have up to ~150 TB more. If not, we may need to prioritize.

@rabernat Do you know of any other expected users of these ocean variables? We might need to have some good justification for this >2x allocation increase.

rabernat commented 4 years ago

TEMP, UVEL, VVEL, WVEL, SHF, and SFWF would be the bare minimum I think.

Will try to get a sense of other potential users.

dbonan commented 4 years ago

I've been using some of the ocean output from the CESM-LE. I've mainly been looking at overturning, heat transport, and surface forcing (i.e., MOC, SHF, UVEL, VVEL, TEMP, SST, SALT). I know there would be a lot of interest in biogeochemical variables, too. I agree it would be nice to have this on AWS data storage!

cspencerjones commented 4 years ago

I would definitely use it if available! SHF, SFWF, UVEL, VVEL, VNS, VNT, TEMP and SALT at least would be helpful. But also TAUX, TAUY, UES, UET and PD would be good too.

jbusecke commented 4 years ago

I would definitely be keen to look at some biogeochemical variables, like DIC, DOC and O2. The full O2 budget would be dope but I presume that is a lot of data (not exactly sure which terms are needed but it seems they are usually the ones with ‘_O2’ appended (e.g. VN_O2, UE_O2 etc). Thanks for pinging me.

bonnland commented 4 years ago

@rabernat I'm in the process of creating the Zarr files for TEMP, UVEL, VVEL, WVEL, SHF, and SFWF, just as an initial test. I've discovered in the process, that the coordinate dimension describing vertical levels has different names depending on the variable. For example:

 UVEL                  (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 12, 30, 384, 320), meta=np.ndarray>

 WVEL                  (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 12, 60, 384, 320), meta=np.ndarray>

The chunk size for UVEL is 30 because we were originally thinking of splitting the 60 vertical levels into two chunks. We could do the same for WVEL; we just need to be careful about using the different coordinate dimension names when we specify chunk sizes.

Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct? Also, is there perhaps a better chunking strategy than what we are considering here?

mnlevy1981 commented 4 years ago

@bonnland the different vertical coordinates signify different locations in the level: z_t is the center, while z_w_top is the top of the level and z_w_bot is the bottom of the level. Most variables will be at cell centers, i.e. z_t, though some of those are only saved in the top 150m (z_t_150m). Note that this last dimension is only 15 levels, rather than the 60 levels comprising the other dimensions.

That's a longwinded way of saying

Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct?

Keep them distinct, please

rabernat commented 4 years ago

Keep them distinct, please

👍. When producing analysis-ready data, we should always think very carefully before changing any of the metadata.

z_t is the center, while z_w_top is the top of the level and z_w_bot is the bottom of the level.

Going a bit off topic, but I find POP to be pretty inconsistent about its dimension naming conventions. In the vertical, it uses different dimension names for the different grid positions. But in the horizontal, it is perfectly happy to use nlon and nlat as the dimensions for all the variables, regardless of whether they are at cell center, corner, face, etc. @mnlevy1981, do you have any insight into this choice?

mnlevy1981 commented 4 years ago

Going a bit off topic, but I find POP to be pretty inconsistent about its dimension naming conventions. In the vertical, it uses different dimension names for the different grid positions. But in the horizontal, it is perfectly happy to use nlon and nlat as the dimensions for all the variables, regardless of whether they are at cell center, corner, face, etc. @mnlevy1981, do you have any insight into this choice?

I don't know for sure, but I suspect that POP (or an ancestor of POP) originally had z_w = z_t + 1 and then someone realized that all the variables output on the interface could be sorted into either the "0 at the surface" bucket or the "0 at the ocean floor" bucket so there was a chance to save some memory in output files by splitting the z_w coordinate into z_w_top and z_w_bot (and at that point, z_t and z_w were already ingrained in the code so it wasn't worth condensing to nz). Meanwhile, the two horizontal coordinate spaces (TLAT, TLONG and ULAT, ULONG)* always had the same dimensions because of the periodic nature of the horizontal grid. That's pure speculation, though.

* Going even further off topic, the inconsistency that trips me up is trying to remember when I need the "g" in "lon"... going off memory, I'm 80% sure it's nlon but TLONG and ULONG. I see the last two get shortened to tlon and ulon in random scripts often enough that I need to stop and think about it.

bonnland commented 4 years ago

@rabernat I'm finishing up code for processing and publishing the ocean variables. I'd like to see what difference zlib compression makes. Are there any special parameters needed, or just use all defaults for the compression? Do you have an example of specifying this compression choice?

andersy005 commented 4 years ago

We may want to stick with the default compressor because it appears to be providing a pretty good compression ratio:

In [1]: import zarr

In [2]: zstore = "/glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr"

In [3]: ds = zarr.open_consolidated(zstore)

In [5]: ds["SFWF"].info
Out[5]:
Name               : /SFWF
Type               : zarr.core.Array
Data type          : float32
Shape              : (40, 1872, 384, 320)
Chunk shape        : (40, 12, 384, 320)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.ConsolidatedMetadataStore
Chunk store type   : zarr.storage.DirectoryStore
No. bytes          : 36805017600 (34.3G)
Chunks initialized : 156/156

In [7]: !du -h {zstore}
2.5K    /glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr/time_bound
....
13G /glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr

bonnland commented 4 years ago

@rabernat I should be transferring the following to AWS sometime today and tomorrow: TEMP, UVEL, VVEL, WVEL, VNS, VNT, SHF, SFWF. All will cover the CTRL, RCP85, and 20C experiments. @andersy005 should be updating the AWS intake catalog when the transfer is complete.

bonnland commented 4 years ago

Actually, it looks like we inadvertently wrote out the Zarr files with incorrect metadata. It is going to take a few more days to re-write and then transfer to AWS.

rabernat commented 4 years ago

@bonnland no worries, obviously everything is going slow these days.

Could you explain a bit about how this incorrect metadata arose? Just for the the sake of there watching this thread, it would be good to understand the potential pitfalls in producing zarr datasets.

Sent from my iPhone

On Mar 19, 2020, at 8:50 PM, bonnland notifications@github.com wrote:

Actually, it looks like we inadvertently wrote out the Zarr files with incorrect metadata. It is going to take a few more days to re-write and then transfer to AWS.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jeffdlb commented 4 years ago

Thanks for working on this, @bonnland and @andersy005 .

Regarding the metadata: Given that it is all in separate text/json files, can those files be replaced with correct versions without regenerating all the binary objects? I ask because (a) although it might seem simpler just to regenerate if the data are still on GLADE, it would be good to have a repair-in-place ability once the data were uploaded, and (b) the zarr metadata is shockingly limited in terms of scientific and geospatial information (not even a link to a separate metadata record) and we should be able to augment that metadata in the future.

rabernat commented 4 years ago

Given that it is all in separate text/json files, can those files be replaced with correct versions without regenerating all the binary objects?

Exactly why I was asking. This is precisely one of the main advantages of zarr. You can fix metadata by editing a simple text file.

the zarr metadata is shockingly limited in terms of scientific and geospatial information (not even a link to a separate metadata record) and we should be able to augment that metadata in the future.

This can be controlled by the data producer. Here is an example of a zarr record with lots of metadata preserved from the original NetCDF files: https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

jeffdlb commented 4 years ago

@rabernat wrote:

This can be controlled by the data producer. Here is an example of a zarr record with lots of metadata preserved from the original NetCDF files: https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

Now that is exemplary! We should be doing the same with our NCAR data.

jeffdlb commented 4 years ago

@bonnland Do you have updated estimate of compressed size of new data? I asked our AWS friends about going above our limit but have not heard back, and want to ping them again with more accurate info if available.

bonnland commented 4 years ago

@rabernat @jeffdlb I will try to relay what Anderson has told me; @andersy005 may be needed to verify I am right. The CESM output files were not very careful about which variables were placed in the coordinates section. The concatenation behavior for xarray has been changing over time, so some expert care is needed in examining these coordinate sections to make sure that xarray datasets can be combined when appropriate. This last time, we tried a manual approach to move variables to the appropriate section, but we didn't catch all of the variables. Here is what our failed attempt looked like for TEMP:


In [5]: temp
Out[5]:
<xarray.Dataset>
Dimensions:               (d2: 2, lat_aux_grid: 395, member_id: 40, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 1140, transport_comp: 5, transport_reg: 2, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
    TLAT                  (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    TLONG                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ULAT                  (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ULONG                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
  * lat_aux_grid          (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0
  * member_id             (member_id) int64 1 2 3 4 5 6 ... 101 102 103 104 105
  * moc_z                 (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06
  * time                  (time) object 2006-02-01 00:00:00 ... 2101-01-01 00:00:00
  * z_t                   (z_t) float32 500.0 1500.0 ... 512502.8 537500.0
  * z_t_150m              (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0
  * z_w                   (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * z_w_bot               (z_w_bot) float32 1000.0 2000.0 ... 549999.06
  * z_w_top               (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94

Here is what the header for the prior-published SALT variable looks like:

In [2]: salt = xr.open_zarr("/glade/campaign/cisl/iowa/lens-aws/ocn/monthly/cesmLE-20C-SALT.zarr", consolidated=True)

In [3]: salt
Out[3]:
<xarray.Dataset>
Dimensions:               (d2: 2, member_id: 40, moc_comp: 3, nlat: 384, nlon: 320, time: 1032, transport_comp: 5, transport_reg: 2, z_t: 60, z_w: 60)
Coordinates:
  * member_id             (member_id) int64 1 2 3 4 5 6 ... 101 102 103 104 105
  * time                  (time) object 1920-02-01 00:00:00 ... 2006-01-01 00:00:00
  * z_t                   (z_t) float32 500.0 1500.0 ... 512502.8 537500.0
  * z_w                   (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94

Apparently the capitalized variables in the Coordinates section (TLONG, etc), should not be there, and it is not a simple matter of just rewriting the metadata section.

bonnland commented 4 years ago

@jeffdlb At first glance, there are discrepancies in the compression rate between the Zarr files that I produced and those that Anderson produced. I am seeing a compression savings of around 50-70% (600 GB on disk for a 1.2 TB Zarr store), and @andersy005 is seeing 600GB on disk for a 2.1 TB Zarr store. We should have a better understanding and answer in the next day or so.

bonnland commented 4 years ago

@jeffdlb Each 2D variable (SST, etc) is taking about 35 +/- 5GB of disk space for the three experiments (CTRL, 20C, RCP85). Each 3D variable (TEMP, UVEL, etc) is taking about 1800 +/- 100 GB of disk space for the same three experiments.

jeffdlb commented 4 years ago

Thanks for that estimate. I don't know exactly what list of variables we settled on, however -- how many 2D and 3D?

bonnland commented 4 years ago

@jeffdlb We decided on adding six 3D variables (TEMP, UVEL, VVEL, WVEL, VNS, VNT) and two 2D variables (SHF, SFWF). The total disk space needed for these new variables was around 10TB total.

@rabernat The Zarr files have been created and were transferred to AWS over the weekend. When @andersy005 merges #37, you will have the updated catalog. Could you let us know whether you are successful accessing the data, and whether you want to prioritize any other variables in the short term?

bonnland commented 4 years ago

@rabernat If you are aware of any need or desire for more variables right now, just reply here. I've tried loading and using the variables recently added on AWS, and it appears to work for me.

jeffdlb commented 4 years ago

@bonnland @andersy005 Thank you for creating and copying Zarr and updating the intake-esm catalog. My documentation page needs to catch up now.

Regarding more data: I thought @rabertnat and others had a considerably longer list than just the six 3D variables (TEMP, UVEL, VVEL, WVEL, VNS, VNT) and two 2D variables (SHF, SFWF) mentioned above. AWS has said we can go up to 200TB if necessary.

bonnland commented 4 years ago

@jeffdlb OK, it sounds like we have no space shortages after all. I thought maybe we were pushing up against a 100TB limit for all data, but 200TB gives us plenty to work with. I can go about publishing the remaining variables that have been mentioned, and these should fit within the ~120 TB of available space on AWS.

While it feels slightly risky to do this work before verifying that computational performance is good enough for the chunking strategy we have used so far, we have some experience with the Kay et al notebook, suggesting that performance is good enough.

rabernat commented 4 years ago

Hi everyone! Thanks for pushing this forward!

Like many others, I have fallen behind on my projects during the current situation. I am still enthusiastic about working with this data, but it has been hard to find time recently. My apologies.

jeffdlb commented 4 years ago

So should we hold off on copying additional data to AWS?

rabernat commented 4 years ago

I don't think I can make that decision for you. Simply stating that, due to the coronvirus pandemic and associated impacts on my time (enormous new child care responsibilities, remote teaching, etc.), I personally won't be able to do much on this until May (post spring semester).

cspencerjones commented 4 years ago

Thanks very much for doing this! I will try to make a start with what's there sometime next week. If it is easy to upload TAUX, TAUY, those would also be helpful to have (though I can start without them if you'd prefer to wait until I've tried it).

bonnland commented 4 years ago

@cspencerjones Thanks for offering to check things. It takes a good chunk of CPU hours to produce these files, so I'd feel better knowing there isn't some glitch in what we have so far that makes these data difficult to use.

I will create and upload TAUX and TAUY, hopefully by Tuesday, and I'll respond here when they are ready. It would be great to see if you can use them successfully before creating more Zarr files.

rabernat commented 4 years ago

I did find some time to simply open up some data. Overall it looks good! Thanks for making this happen. Based on this quick look, I do have some feedback.

Let's consider, for example, WVEL, the vertical velocity:

import s3fs
import xarray as xr

fs = s3fs.S3FileSystem(anon=True)
s3_path = 's3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-WVEL.zarr'
ds = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
ds

Which gives the following long output:

``` Dimensions: (d2: 2, lat_aux_grid: 395, member_id: 1, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 21612, transport_comp: 5, transport_reg: 2, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60) Coordinates: * lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0 * member_id (member_id) int64 1 * moc_z (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06 * time (time) object 0400-02-01 00:00:00 ... 2201-01-01 00:00:00 * z_t (z_t) float32 500.0 1500.0 ... 512502.8 537500.0 * z_t_150m (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0 * z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94 * z_w_bot (z_w_bot) float32 1000.0 2000.0 ... 549999.06 * z_w_top (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94 Dimensions without coordinates: d2, moc_comp, nlat, nlon, transport_comp, transport_reg Data variables: ANGLE (nlat, nlon) float64 dask.array ANGLET (nlat, nlon) float64 dask.array DXT (nlat, nlon) float64 dask.array DXU (nlat, nlon) float64 dask.array DYT (nlat, nlon) float64 dask.array DYU (nlat, nlon) float64 dask.array HT (nlat, nlon) float64 dask.array HTE (nlat, nlon) float64 dask.array HTN (nlat, nlon) float64 dask.array HU (nlat, nlon) float64 dask.array HUS (nlat, nlon) float64 dask.array HUW (nlat, nlon) float64 dask.array KMT (nlat, nlon) float64 dask.array KMU (nlat, nlon) float64 dask.array REGION_MASK (nlat, nlon) float64 dask.array T0_Kelvin float64 ... TAREA (nlat, nlon) float64 dask.array TLAT (nlat, nlon) float64 dask.array TLONG (nlat, nlon) float64 dask.array UAREA (nlat, nlon) float64 dask.array ULAT (nlat, nlon) float64 dask.array ULONG (nlat, nlon) float64 dask.array WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array cp_air float64 ... cp_sw float64 ... days_in_norm_year timedelta64[ns] ... dz (z_t) float32 dask.array dzw (z_w) float32 dask.array fwflux_factor float64 ... grav float64 ... heat_to_PW float64 ... hflux_factor float64 ... latent_heat_fusion float64 ... latent_heat_vapor float64 ... mass_to_Sv float64 ... moc_components (moc_comp) |S256 dask.array momentum_factor float64 ... nsurface_t float64 ... nsurface_u float64 ... ocn_ref_salinity float64 ... omega float64 ... ppt_to_salt float64 ... radius float64 ... rho_air float64 ... rho_fw float64 ... rho_sw float64 ... salinity_factor float64 ... salt_to_Svppt float64 ... salt_to_mmday float64 ... salt_to_ppt float64 ... sea_ice_salinity float64 ... sflux_factor float64 ... sound float64 ... stefan_boltzmann float64 ... time_bound (time, d2) object dask.array transport_components (transport_comp) |S256 dask.array transport_regions (transport_reg) |S256 dask.array vonkar float64 ... Attributes: Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc... NCO: 4.3.4 calendar: All years have exactly 365 days. cell_methods: cell_methods = time: mean ==> the variable val... contents: Diagnostic and Prognostic Variables history: Thu Oct 10 08:38:35 2013: /glade/apps/opt/nco/... intake_esm_varname: WVEL nco_openmp_thread_number: 1 revision: $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy... source: CCSM POP2, the CCSM Ocean Component tavg_sum: 2678400.0 tavg_sum_qflux: 2678400.0 title: b.e11.B1850C5CN.f09_g16.005 ```

Based on this, I have two suggestions.

All variables but WVEL should be coordinates, not data variables. This is easily accomplished with the following code:
```
coord_vars = [vname for vname in ds.data_vars if 'time' not in ds[vname].dims]
ds_fixed = ds.set_coords(coord_vars)
```
It should be possible to fix this issue just by rewriting the zarr metadata, rather than re-outputting the whole dataset.
The chunk choice on WVEL (and presumably other 3D variables) is, in my view, less than ideal:
```
WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 480, 1, 384, 320), meta=np.ndarray>
```
First, the chunks are on the large side (235.93 MB). Second, each vertical level is in a separate chunk, while 20 years of time are stored contiguously. If I want to get a complete 3D field for a single timestep, I therefore have to download over 14 GB of data. I recognize that the choice of chunks is subjective and depends on the use case. However, based on my experience working with ocean model output, I think the most common use case is to access all vertical levels in a single contiguous chunk. (This corresponds to how netCDF files are commonly output and is what people are used to.) I would recommend instead using chunks ds.WVEL.chunk({'time': 6, 'z_w_top': -1, 'nlon': -1, 'nlat': -1}), which would produce ~175 MB chunks.

I hope this feedback is useful.

bonnland commented 4 years ago

@rabernat That is helpful feedback, and worth talking about IMHO, thank you. I will move forward with the chunking you suggest if I don't hear any objections in the next day or so.

This issue of which variables should be coordinates has come up before in discussions with @andersy005 . Across variables, and possibly across ensemble members, in the original NetCDF files, these extra variables differ (examples: ULAT, ULONG, TLAT, TLONG are missing in some cases). The differences can apparently prevent concatenation into Xarray objects from working properly. I'm not as clear as Anderson on the potential problems. At any rate, It's good that we can address the metadata later if needed. It means I can move forward with creating these variables now.

rabernat commented 4 years ago

The differences can apparently prevent concatenation into Xarray objects from working properly. I'm not as clear as Anderson on the potential problems

I can see how this could cause problems. However, I personally prefer to have all that stuff as coordinates. It's easy enough to just .reset_coords(drop=True) before any merge / alignment operations.

An even better option is to just drop all of the non-dimension coordinates before writing the zarr data, and then saving them to a standalone grid dataset, which can be brought in as needed for geometric calculations. That's what we did, for example, with the MITgcm LLC4320 dataset. You're currently wasting a non-negligible amount of space by storing all of these duplicate TAREA etc. variables in each of the ocean datasets.

NCAR / cesm-lens-aws

Where are all the ocean variables? #34