Open rabernat opened 4 years ago
Ryan-
You are correct there is not much ocean data there. The plan was to add additional data on request, so thanks for your request. We do have room in our AWS allocation to add more, and I would be glad for us to do so. I will discuss this with some team members at our meeting on Thursday. Besides the THETA, UVEL, VVEL, WVEL fields you mentioned, can you indicate which specific other variables you desire? Gary Strand's list of variable names is at http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html
Thanks @jeffdlb! I will review the list of fields and get back to you. Our general interest is ocean heat and salt budgets.
PS. We currently only have monthly ocean data, whereas the other realms also have some daily or 6-hour data. Is monthly sufficient for your use case?
For coarse-resolution (non eddy-resolving) models, the oceans tend not to have too much sub-monthly variability. If we did need daily data, it would just be surface fluxes. Monthly should be fine for the other stuff.
Ok, here is my best guess at identifying the variables we would need for the heat and salt budgets. Would be good for someone with more POP experience (e.g. @matt-long) to verify.
THETA
UVEL
UVEL2
VVEL
VVEL2
WVEL
FW
HDIFB_SALT
HDIFB_TEMP
HDIFE_SALT
HDIFE_TEMP
HDIFN_SALT
HDIFN_TEMP
HMXL
HOR_DIFF
KAPPA_ISOP
KAPPA_THIC
KPP_SRC_SALT
KPP_SRC_TEMP
RESID_S
RESID_T
QFLUX
SHF
SHF_QSW
SFWF
SFWF_WRST
SSH
TAUX
TAUX2
TAUY
TAUY2
VNT_ISOP
VNT_SUBM
UES
UET
VNS
VNT
WTS
WTT
Ryan-
Many of these variables do not seem present in the monthly ocean data on GLADE:
1310> pwd /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries 1311> while read ln; do ls -d "monthly/$ln"; done < ~/oceanVars.txt ls: cannot access monthly/THETA: No such file or directory monthly/UVEL ls: cannot access monthly/UVEL2: No such file or directory monthly/VVEL ls: cannot access monthly/VVEL2: No such file or directory monthly/WVEL ls: cannot access monthly/FW: No such file or directory ls: cannot access monthly/HDIFB_SALT: No such file or directory ls: cannot access monthly/HDIFB_TEMP: No such file or directory ls: cannot access monthly/HDIFE_SALT: No such file or directory ls: cannot access monthly/HDIFE_TEMP: No such file or directory ls: cannot access monthly/HDIFN_SALT: No such file or directory ls: cannot access monthly/HDIFN_TEMP: No such file or directory monthly/HMXL ls: cannot access monthly/HOR_DIFF: No such file or directory ls: cannot access monthly/KAPPA_ISOP: No such file or directory ls: cannot access monthly/KAPPA_THIC: No such file or directory ls: cannot access monthly/KPP_SRC_SALT: No such file or directory ls: cannot access monthly/KPP_SRC_TEMP: No such file or directory ls: cannot access monthly/RESID_S: No such file or directory ls: cannot access monthly/RESID_T: No such file or directory monthly/QFLUX monthly/SHF monthly/SHF_QSW monthly/SFWF ls: cannot access monthly/SFWF_WRST: No such file or directory monthly/SSH monthly/TAUX monthly/TAUX2 monthly/TAUY monthly/TAUY2 ls: cannot access monthly/VNT_ISOP: No such file or directory ls: cannot access monthly/VNT_SUBM: No such file or directory monthly/UES ls: cannot access monthly/UET: No such file or directory monthly/VNS ls: cannot access monthly/VNT: No such file or directory monthly/WTS ls: cannot access monthly/WTT: No such file or directory
Jeff de La Beaujardiere, PhD Director, NCAR/CISL Information Systems Division https://staff.ucar.edu/users/jeffdlb https://orcid.org/0000-0002-1001-9210
On Wed, Feb 12, 2020 at 1:46 PM Ryan Abernathey notifications@github.com wrote:
Ok, here is my best guess at identifying the variables we would need for the heat and salt budgets. Would be good for someone with more POP experience (e.g. @matt-long https://github.com/matt-long) to verify.
THETA UVEL UVEL2 VVEL VVEL2 WVEL FW HDIFB_SALT HDIFB_TEMP HDIFE_SALT HDIFE_TEMP HDIFN_SALT HDIFN_TEMP HMXL HOR_DIFF KAPPA_ISOP KAPPA_THIC KPP_SRC_SALT KPP_SRC_TEMP RESID_S RESID_T QFLUX SHF SHF_QSW SFWF SFWF_WRST SSH TAUX TAUX2 TAUY TAUY2 VNT_ISOP VNT_SUBM UES UET VNS VNT WTS WTT
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCAR/cesm-lens-aws/issues/34?email_source=notifications&email_token=ABF4W4V7YH2OHFOZ3IW4DLTRCRNY3A5CNFSM4KPJYHB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELSKIBY#issuecomment-585409543, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF4W4TKKWSQ4WIUUTK5N7TRCRNY3ANCNFSM4KPJYHBQ .
TEMP is the POP variable name for potential temperature (not THETA).
Many of the data not available on glade are available on HPSS
And (possibly) on NCAR Campaign storage here /glade/campaign/cesm/collections/cesmLE (accessible on Casper)
Note that to close a tracer budget you need Advection UE{S,T} VN{S,T} WT{S,T}
Lateral diffusion (GM, submeso) HDIF{E,N,B}_{S,T}
Vertical mixing KPPSRC{SALT,TEMP} DIAIMPVF{SALT,TEMP} # implicit diabatic vertical mixing, I think you missed this one
Surface fluxes # some choices SHF_QSW # I don't think we save fully 3D QSW, unfortunately QSW_HTP # top layer QSW_HBL # boundary layer SHF QFLUX SFWF
Inventory TEMP SALT SSH # free-surface deviations impact the tracer mass in top layer, where dz = dz + SSH
According to Gary Strand, all of the LENS data is available on Glade here:
/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
When I look at what is available for the ocean, I find this:
casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls annual/
DIA_IMPVF_DIC/ HDIFB_DOC/ HDIFE_O2/ J_Fe/ KPP_SRC_Fe/ VN_DIC/ WT_DOC/
DIA_IMPVF_DIC_ALT_CO2/ HDIFB_Fe/ HDIFN_DIC/ J_NH4/ KPP_SRC_O2/ VN_DIC_ALT_CO2/ WT_Fe/
DIA_IMPVF_DOC/ HDIFB_O2/ HDIFN_DIC_ALT_CO2/ J_NO3/ UE_DIC/ VN_DOC/ WT_O2/
DIA_IMPVF_Fe/ HDIFE_DIC/ HDIFN_DOC/ J_PO4/ UE_DIC_ALT_CO2/ VN_Fe/
DIA_IMPVF_O2/ HDIFE_DIC_ALT_CO2/ HDIFN_Fe/ J_SiO3/ UE_DOC/ VN_O2/
HDIFB_DIC/ HDIFE_DOC/ HDIFN_O2/ KPP_SRC_DIC/ UE_Fe/ WT_DIC/
HDIFB_DIC_ALT_CO2/ HDIFE_Fe/ J_ALK/ KPP_SRC_DIC_ALT_CO2/ UE_O2/ WT_DIC_ALT_CO2/
casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls monthly
ALK/ DIC_ALT_CO2/ get MELTH_F/ photoC_diat/ SENH_F/ TAUY/ VISOP/
ATM_CO2/ DOC/ HBLT/ MOC/ photoC_diaz/ SFWF/ TAUY2/ VNS/
BSF/ DpCO2/ HMXL/ N_HEAT/ photoC_sp/ SHF/ TBLT/ VVEL/
CFC11/ DpCO2_ALT_CO2/ IAGE/ N_SALT/ PREC_F/ SHF_QSW/ TEMP/ WISOP/
CFC_ATM_PRESS/ ECOSYS_ATM_PRESS/ IFRAC/ O2/ QFLUX/ SNOW_F/ tend_zint_100m_ALK/ WTS/
CFC_IFRAC/ ECOSYS_IFRAC/ INT_DEPTH/ O2_CONSUMPTION/ QSW_HBL/ spCaCO3/ tend_zint_100m_DIC/ WVEL/
CFC_XKW/ ECOSYS_XKW/ IOFF_F/ O2_PRODUCTION/ QSW_HTP/ spChl/ tend_zint_100m_DIC_ALT_CO2/ XBLT/
CO2STAR/ EVAP_F/ Jint_100m_ALK/ O2SAT/ RHO/ SSH/ tend_zint_100m_DOC/ XMXL/
DCO2STAR/ FG_ALT_CO2/ Jint_100m_DIC/ O2_ZMIN/ RHO_VINT/ SSH2/ tend_zint_100m_O2/ zsatarag/
DCO2STAR_ALT_CO2/ FG_CO2/ Jint_100m_DOC/ O2_ZMIN_DEPTH/ ROFF_F/ SST/ TLT/ zsatcalc/
DIA_DEPTH/ FvICE_ALK/ Jint_100m_O2/ pCO2SURF/ SALT/ STF_CFC11/ TMXL/
diatChl/ FvICE_DIC/ LWDN_F/ PD/ SALT_F/ STF_O2/ UES/
diazChl/ FvPER_ALK/ LWUP_F/ PH/ SCHMIDT_CO2/ TAUX/ UISOP/
DIC/ FvPER_DIC/ MELT_F/ PH_ALT_CO2/ SCHMIDT_O2/ TAUX2/ UVEL/
casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls daily
CaCO3_form_zint/ diazC_zint_100m/ ECOSYS_XKW_2/ nday1/ spCaCO3_zint_100m/ SST/ TAUY_2/ zooC_zint_100m/
diatChl_SURF/ DpCO2_2/ FG_CO2_2/ photoC_diat_zint/ spChl_SURF/ SST2/ WVEL_50m/
diatC_zint_100m/ ecosys/ HBLT_2/ photoC_diaz_zint/ spC_zint_100m/ STF_O2_2/ XBLT_2/
diazChl_SURF/ ECOSYS_IFRAC_2/ HMXL_2/ photoC_sp_zint/ SSH_2/ TAUX_2/ XMXL_2/
casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$
If someone can help decipher these variables and determine if they are worth publishing, I would be happy to work on getting them onto AWS.
@jeffdlb @rabernat On second glance, it appears that my directory listing above shows that some variables are missing. I'm waiting to hear from Gary to get some clarification.
@bonnland -- were you the one who produced the original S3 LENS datasets? If so, it would be nice to build on that effort. My impression from @jeffdlb is that they have a pipeline set up, they just need to find the data! Maybe you were part of that...sorry for my ignorance.
As for the missing variables, I guess I would just request that you take the intersection between my requested list and what is actually available. I think that the list of monthly and daily variables you showed above is a great start. I would use nearly all of it.
@rabernat I just got word from Gary; I was originally in a slightly different folder. I have the correct folder path now, and all 273 monthly ocean variables appear to be present.
I was part of the original data publishing, so I know parts of the workflow. The most time consuming part is creating the CSV file describing an intake-esm catalog, which I did not originally take part in. The catalog is used to load data into xarray and then write out to Zarr. We have the file paths now; I just need to research how to construct the remaining fields for the CSV file.
@rabernat I've loaded some variables, and the datasets are big. A single variable will take over 2TB. Here are some stats for five of the variables:
Note that these sizes are uncompressed sizes, and they will be smaller on disk.
Is there a priority ordering that makes sense if we can initially publish just a subset? Anderson believes that if the available space on AWS has not changed, we have around 30 TB available.
@jeffdlb Do you know more exactly how much space is left on S3, and when we might get more?
Brian-
On Thu, Mar 5, 2020 at 5:06 PM bonnland notifications@github.com wrote:
@jeffdlb https://github.com/jeffdlb Do you know more exactly how much space is left on S3, and when we might get more?
We are currently using 61.5 TB for 905023 objects in ncar-cesm-lens bucket, the vast majority of which is for atmosphere data. Ocean data only use 1.2 TB at present.
I don't think there is an automatically-enforced limit to the allocation, so nothing will prevent writing objects after 100 TB. However, we should as a courtesy notify AWS if we plan to go over. They have already said we can use more if needed, within reason.
Is 20.95TB the total for all the new ocean variables, or only for a subset? If subset, can you estimate the total (uncompressed) for all the new vars?
-Jeff
Jeff de La Beaujardiere, PhD Director, NCAR/CISL Information Systems Division https://staff.ucar.edu/users/jeffdlb https://orcid.org/0000-0002-1001-9210
Is 20.95TB the total for all the new ocean variables, or only for a subset? If subset, can you estimate the total (uncompressed) for all the new vars?
The 20.95 TB is for only 5 variables.
Of the 39 variables listed in https://github.com/NCAR/cesm-lens-aws/issues/34#issuecomment-585409543, we found 38 variables. A back of the envelope calculation shows that their total uncompressed size would be ~ 170TB.
Do we have any idea what typical zarr + zlib compression rates are for these datasets? I would not be surprised to see a factor for 2 or more.
@rabernat The one data point I have so far is for atm/monthly/cesmLE-RCP85-TREFHT.zarr: 5.5 GB storage 10.1 GB uncompressed
I will ask Joe & Ana whether we can have up to ~150 TB more. If not, we may need to prioritize.
@rabernat Do you know of any other expected users of these ocean variables? We might need to have some good justification for this >2x allocation increase.
TEMP, UVEL, VVEL, WVEL, SHF, and SFWF would be the bare minimum I think.
Will try to get a sense of other potential users.
I've been using some of the ocean output from the CESM-LE. I've mainly been looking at overturning, heat transport, and surface forcing (i.e., MOC, SHF, UVEL, VVEL, TEMP, SST, SALT). I know there would be a lot of interest in biogeochemical variables, too. I agree it would be nice to have this on AWS data storage!
I would definitely use it if available! SHF, SFWF, UVEL, VVEL, VNS, VNT, TEMP and SALT at least would be helpful. But also TAUX, TAUY, UES, UET and PD would be good too.
I would definitely be keen to look at some biogeochemical variables, like DIC, DOC and O2. The full O2 budget would be dope but I presume that is a lot of data (not exactly sure which terms are needed but it seems they are usually the ones with ‘_O2’ appended (e.g. VN_O2, UE_O2 etc). Thanks for pinging me.
@rabernat I'm in the process of creating the Zarr files for TEMP, UVEL, VVEL, WVEL, SHF, and SFWF, just as an initial test. I've discovered in the process, that the coordinate dimension describing vertical levels has different names depending on the variable. For example:
UVEL (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 12, 30, 384, 320), meta=np.ndarray>
WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 12, 60, 384, 320), meta=np.ndarray>
The chunk size for UVEL is 30 because we were originally thinking of splitting the 60 vertical levels into two chunks. We could do the same for WVEL; we just need to be careful about using the different coordinate dimension names when we specify chunk sizes.
Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct? Also, is there perhaps a better chunking strategy than what we are considering here?
@bonnland the different vertical coordinates signify different locations in the level: z_t
is the center, while z_w_top
is the top of the level and z_w_bot
is the bottom of the level. Most variables will be at cell centers, i.e. z_t
, though some of those are only saved in the top 150m (z_t_150m
). Note that this last dimension is only 15 levels, rather than the 60 levels comprising the other dimensions.
That's a longwinded way of saying
Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct?
Keep them distinct, please
Keep them distinct, please
👍. When producing analysis-ready data, we should always think very carefully before changing any of the metadata.
z_t
is the center, whilez_w_top
is the top of the level andz_w_bot
is the bottom of the level.
Going a bit off topic, but I find POP to be pretty inconsistent about its dimension naming conventions. In the vertical, it uses different dimension names for the different grid positions. But in the horizontal, it is perfectly happy to use nlon
and nlat
as the dimensions for all the variables, regardless of whether they are at cell center, corner, face, etc. @mnlevy1981, do you have any insight into this choice?
Going a bit off topic, but I find POP to be pretty inconsistent about its dimension naming conventions. In the vertical, it uses different dimension names for the different grid positions. But in the horizontal, it is perfectly happy to use
nlon
andnlat
as the dimensions for all the variables, regardless of whether they are at cell center, corner, face, etc. @mnlevy1981, do you have any insight into this choice?
I don't know for sure, but I suspect that POP (or an ancestor of POP) originally had z_w = z_t + 1
and then someone realized that all the variables output on the interface could be sorted into either the "0 at the surface" bucket or the "0 at the ocean floor" bucket so there was a chance to save some memory in output files by splitting the z_w
coordinate into z_w_top
and z_w_bot
(and at that point, z_t
and z_w
were already ingrained in the code so it wasn't worth condensing to nz
). Meanwhile, the two horizontal coordinate spaces (TLAT, TLONG
and ULAT, ULONG
)* always had the same dimensions because of the periodic nature of the horizontal grid. That's pure speculation, though.
* Going even further off topic, the inconsistency that trips me up is trying to remember when I need the "g" in "lon"... going off memory, I'm 80% sure it's nlon
but TLONG
and ULONG
. I see the last two get shortened to tlon
and ulon
in random scripts often enough that I need to stop and think about it.
@rabernat I'm finishing up code for processing and publishing the ocean variables. I'd like to see what difference zlib compression makes. Are there any special parameters needed, or just use all defaults for the compression? Do you have an example of specifying this compression choice?
We may want to stick with the default compressor because it appears to be providing a pretty good compression ratio:
In [1]: import zarr
In [2]: zstore = "/glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr"
In [3]: ds = zarr.open_consolidated(zstore)
In [5]: ds["SFWF"].info
Out[5]:
Name : /SFWF
Type : zarr.core.Array
Data type : float32
Shape : (40, 1872, 384, 320)
Chunk shape : (40, 12, 384, 320)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.ConsolidatedMetadataStore
Chunk store type : zarr.storage.DirectoryStore
No. bytes : 36805017600 (34.3G)
Chunks initialized : 156/156
In [7]: !du -h {zstore}
2.5K /glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr/time_bound
....
13G /glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr
@rabernat I should be transferring the following to AWS sometime today and tomorrow: TEMP, UVEL, VVEL, WVEL, VNS, VNT, SHF, SFWF. All will cover the CTRL, RCP85, and 20C experiments. @andersy005 should be updating the AWS intake catalog when the transfer is complete.
Actually, it looks like we inadvertently wrote out the Zarr files with incorrect metadata. It is going to take a few more days to re-write and then transfer to AWS.
@bonnland no worries, obviously everything is going slow these days.
Could you explain a bit about how this incorrect metadata arose? Just for the the sake of there watching this thread, it would be good to understand the potential pitfalls in producing zarr datasets.
Sent from my iPhone
On Mar 19, 2020, at 8:50 PM, bonnland notifications@github.com wrote:
Actually, it looks like we inadvertently wrote out the Zarr files with incorrect metadata. It is going to take a few more days to re-write and then transfer to AWS.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for working on this, @bonnland and @andersy005 .
Regarding the metadata: Given that it is all in separate text/json files, can those files be replaced with correct versions without regenerating all the binary objects? I ask because (a) although it might seem simpler just to regenerate if the data are still on GLADE, it would be good to have a repair-in-place ability once the data were uploaded, and (b) the zarr metadata is shockingly limited in terms of scientific and geospatial information (not even a link to a separate metadata record) and we should be able to augment that metadata in the future.
Given that it is all in separate text/json files, can those files be replaced with correct versions without regenerating all the binary objects?
Exactly why I was asking. This is precisely one of the main advantages of zarr. You can fix metadata by editing a simple text file.
the zarr metadata is shockingly limited in terms of scientific and geospatial information (not even a link to a separate metadata record) and we should be able to augment that metadata in the future.
This can be controlled by the data producer. Here is an example of a zarr record with lots of metadata preserved from the original NetCDF files: https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
@rabernat wrote:
This can be controlled by the data producer. Here is an example of a zarr record with lots of metadata preserved from the original NetCDF files: https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
Now that is exemplary! We should be doing the same with our NCAR data.
@bonnland Do you have updated estimate of compressed size of new data? I asked our AWS friends about going above our limit but have not heard back, and want to ping them again with more accurate info if available.
@rabernat @jeffdlb I will try to relay what Anderson has told me; @andersy005 may be needed to verify I am right. The CESM output files were not very careful about which variables were placed in the coordinates section. The concatenation behavior for xarray has been changing over time, so some expert care is needed in examining these coordinate sections to make sure that xarray datasets can be combined when appropriate. This last time, we tried a manual approach to move variables to the appropriate section, but we didn't catch all of the variables. Here is what our failed attempt looked like for TEMP:
In [5]: temp
Out[5]:
<xarray.Dataset>
Dimensions: (d2: 2, lat_aux_grid: 395, member_id: 40, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 1140, transport_comp: 5, transport_reg: 2, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
TLAT (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
TLONG (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
ULAT (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
ULONG (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
* lat_aux_grid (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0
* member_id (member_id) int64 1 2 3 4 5 6 ... 101 102 103 104 105
* moc_z (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06
* time (time) object 2006-02-01 00:00:00 ... 2101-01-01 00:00:00
* z_t (z_t) float32 500.0 1500.0 ... 512502.8 537500.0
* z_t_150m (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
* z_w_bot (z_w_bot) float32 1000.0 2000.0 ... 549999.06
* z_w_top (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94
Here is what the header for the prior-published SALT variable looks like:
In [2]: salt = xr.open_zarr("/glade/campaign/cisl/iowa/lens-aws/ocn/monthly/cesmLE-20C-SALT.zarr", consolidated=True)
In [3]: salt
Out[3]:
<xarray.Dataset>
Dimensions: (d2: 2, member_id: 40, moc_comp: 3, nlat: 384, nlon: 320, time: 1032, transport_comp: 5, transport_reg: 2, z_t: 60, z_w: 60)
Coordinates:
* member_id (member_id) int64 1 2 3 4 5 6 ... 101 102 103 104 105
* time (time) object 1920-02-01 00:00:00 ... 2006-01-01 00:00:00
* z_t (z_t) float32 500.0 1500.0 ... 512502.8 537500.0
* z_w (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
Apparently the capitalized variables in the Coordinates section (TLONG, etc), should not be there, and it is not a simple matter of just rewriting the metadata section.
@jeffdlb At first glance, there are discrepancies in the compression rate between the Zarr files that I produced and those that Anderson produced. I am seeing a compression savings of around 50-70% (600 GB on disk for a 1.2 TB Zarr store), and @andersy005 is seeing 600GB on disk for a 2.1 TB Zarr store. We should have a better understanding and answer in the next day or so.
@jeffdlb Each 2D variable (SST, etc) is taking about 35 +/- 5GB of disk space for the three experiments (CTRL, 20C, RCP85). Each 3D variable (TEMP, UVEL, etc) is taking about 1800 +/- 100 GB of disk space for the same three experiments.
Thanks for that estimate. I don't know exactly what list of variables we settled on, however -- how many 2D and 3D?
@jeffdlb We decided on adding six 3D variables (TEMP, UVEL, VVEL, WVEL, VNS, VNT) and two 2D variables (SHF, SFWF). The total disk space needed for these new variables was around 10TB total.
@rabernat The Zarr files have been created and were transferred to AWS over the weekend. When @andersy005 merges #37, you will have the updated catalog. Could you let us know whether you are successful accessing the data, and whether you want to prioritize any other variables in the short term?
@rabernat If you are aware of any need or desire for more variables right now, just reply here. I've tried loading and using the variables recently added on AWS, and it appears to work for me.
@bonnland @andersy005 Thank you for creating and copying Zarr and updating the intake-esm catalog. My documentation page needs to catch up now.
Regarding more data: I thought @rabertnat and others had a considerably longer list than just the six 3D variables (TEMP, UVEL, VVEL, WVEL, VNS, VNT) and two 2D variables (SHF, SFWF) mentioned above. AWS has said we can go up to 200TB if necessary.
@jeffdlb OK, it sounds like we have no space shortages after all. I thought maybe we were pushing up against a 100TB limit for all data, but 200TB gives us plenty to work with. I can go about publishing the remaining variables that have been mentioned, and these should fit within the ~120 TB of available space on AWS.
While it feels slightly risky to do this work before verifying that computational performance is good enough for the chunking strategy we have used so far, we have some experience with the Kay et al notebook, suggesting that performance is good enough.
Hi everyone! Thanks for pushing this forward!
Like many others, I have fallen behind on my projects during the current situation. I am still enthusiastic about working with this data, but it has been hard to find time recently. My apologies.
So should we hold off on copying additional data to AWS?
I don't think I can make that decision for you. Simply stating that, due to the coronvirus pandemic and associated impacts on my time (enormous new child care responsibilities, remote teaching, etc.), I personally won't be able to do much on this until May (post spring semester).
Thanks very much for doing this! I will try to make a start with what's there sometime next week. If it is easy to upload TAUX, TAUY, those would also be helpful to have (though I can start without them if you'd prefer to wait until I've tried it).
@cspencerjones Thanks for offering to check things. It takes a good chunk of CPU hours to produce these files, so I'd feel better knowing there isn't some glitch in what we have so far that makes these data difficult to use.
I will create and upload TAUX and TAUY, hopefully by Tuesday, and I'll respond here when they are ready. It would be great to see if you can use them successfully before creating more Zarr files.
I did find some time to simply open up some data. Overall it looks good! Thanks for making this happen. Based on this quick look, I do have some feedback.
Let's consider, for example, WVEL
, the vertical velocity:
import s3fs
import xarray as xr
fs = s3fs.S3FileSystem(anon=True)
s3_path = 's3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-WVEL.zarr'
ds = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
ds
Which gives the following long output:
Based on this, I have two suggestions.
WVEL
should be coordinates, not data variables. This is easily accomplished with the following code:
coord_vars = [vname for vname in ds.data_vars if 'time' not in ds[vname].dims]
ds_fixed = ds.set_coords(coord_vars)
It should be possible to fix this issue just by rewriting the zarr metadata, rather than re-outputting the whole dataset.
WVEL
(and presumably other 3D variables) is, in my view, less than ideal:
WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 480, 1, 384, 320), meta=np.ndarray>
First, the chunks are on the large side (235.93 MB). Second, each vertical level is in a separate chunk, while 20 years of time are stored contiguously. If I want to get a complete 3D field for a single timestep, I therefore have to download over 14 GB of data. I recognize that the choice of chunks is subjective and depends on the use case. However, based on my experience working with ocean model output, I think the most common use case is to access all vertical levels in a single contiguous chunk. (This corresponds to how netCDF files are commonly output and is what people are used to.) I would recommend instead using chunks ds.WVEL.chunk({'time': 6, 'z_w_top': -1, 'nlon': -1, 'nlat': -1})
, which would produce ~175 MB chunks.
I hope this feedback is useful.
@rabernat That is helpful feedback, and worth talking about IMHO, thank you. I will move forward with the chunking you suggest if I don't hear any objections in the next day or so.
This issue of which variables should be coordinates has come up before in discussions with @andersy005 . Across variables, and possibly across ensemble members, in the original NetCDF files, these extra variables differ (examples: ULAT, ULONG, TLAT, TLONG are missing in some cases). The differences can apparently prevent concatenation into Xarray objects from working properly. I'm not as clear as Anderson on the potential problems. At any rate, It's good that we can address the metadata later if needed. It means I can move forward with creating these variables now.
The differences can apparently prevent concatenation into Xarray objects from working properly. I'm not as clear as Anderson on the potential problems
I can see how this could cause problems. However, I personally prefer to have all that stuff as coordinates. It's easy enough to just .reset_coords(drop=True)
before any merge / alignment operations.
An even better option is to just drop all of the non-dimension coordinates before writing the zarr data, and then saving them to a standalone grid
dataset, which can be brought in as needed for geometric calculations. That's what we did, for example, with the MITgcm LLC4320 dataset. You're currently wasting a non-negligible amount of space by storing all of these duplicate TAREA
etc. variables in each of the ocean datasets.
I started to look at the LENS AWS data. I discovered there is very little available
There are only 3 variables: SALT (3D), SSH (2D), and SST (2D).
At minimum, I would also like to have THETA (3D), UVEL (3D), VVEL (3D), and WVEL (3D), and all the surface fluxes of heat and freshwater. Beyond that, it would be ideal to also have the necessary variables to reconstruct the tracer and momentum budgets.
Are there plans to add more data?