Wishlist of obs datasets

cecilehannay commented 2 years ago

Adding observation to ADF.

What is this new feature?

Jesse is adding model to obs comparison. In this issue, we are collecting the list of datasets and variables we want to prioritize to be able to evaluate the model simulations.

SWCF

Dataset: CERES-EBAF Ed4.1
namelist of variable: SWCF
Location: /glade/work/brianpm/observations/climo_files/CERES_EBAF_Ed4.1_SWCF_climo.nc

LWCF

Dataset: CERES-EBAF Ed4.1
namelist of variable: LWCF
Location: /glade/work/brianpm/observations/climo_files/CERES_EBAF_Ed4.1_SWCF_climo.nc

FLNT

Dataset: CERES-EBAF
namelist of variable: toa_sw_all_mon
Location: /glade/work/brianpm/observations/climo_files/CERES_EBAF_Ed4.1_climo.nc

FSNT

Dataset: CERES-EBAF Ed4.1
namelist of variable: toa_sw_all_mon
Location: /glade/work/brianpm/observations/climo_files/CERES_EBAF_Ed4.1_climo.nc

PSL

Dataset 1

Dataset: MERRA
namelist of variable: PSL
Location: /glade/p/cesm/amwg/amwg_data/obsdata/MERRA*.nc

Dataset 2

Dataset: ERA-Interim
namelist of variable: PSL
Location: /glade/p/cesm/amwg/amwg_data/obsdata/ERAI*.nc

SST

We compare at different periods

pre-industrial

Dataset: HadISST/OI.v2 (pre-industrial: 1870-1899)
namelist of variable: SST
Location: /glade/p/cesm/amwg/amwg_data/obs_data/HadISSTPI*

present day

Dataset: HadISST/OI.v2 (present day: 1999-2008)
namelist of variable: SST
Location: /glade/p/cesm/amwg/amwg_data/obs_data/HadISSTPD*

PRECT

Global

Dataset: GPCP
namelist of variable: PRECT
Location: /glade/p/cesm/amwg/amwg_data/obsdata/GPCP*.nc

Tropical

Dataset: SSMI
namelist of variable: PRECT_OCEAN
Location: /glade/p/cesm/amwg/amwg_data/obsdata/SSMI*

PREH2O

Dataset:
namelist of variable:
Location:

TGCLDLWP

Dataset: UWisc LWP Climatology
namelist of variable: TGCLDLWP_OCEAN
Location: /glade/p/cesm/amwg/amwg_data/obs_data/UWiscLWP*

SURF STRESS (TAUX and TAUY)

Dataset: Large-Yeager 1984-2004
namelist of variable: (TAUX and TAUY)
Location: /glade/p/cesm/amwg/amwg_data/obsdata/LARYEA*

Next Steps

[ ] Step 1: agree on datasets. Please review the set of observations suggested here. If you have any additions/comments/feedback, leave a comment below. If you agree with everything, leave a comment to say so.
[ ] Step 2: Put the datasets in a single long-term location. Where should the dataset lives? For instance, somewhere in: /glade/p/cesmdata/cseg/inputdata or /glade/p/cesm/amwg or ??? Should we have a repo to outside people can access them?
[ ] Step 3: Rename datasets and add metadata as needed. It is important to have metadata that indicates version, creator, etc. A while back, we had started to require AMWG metadata for new any AMWG dataset. Attempt for CESM metadata was discussed here

cecilehannay commented 2 years ago

Next Steps

[ ] Step 1: agree on datasets. Please review the set of observations suggested here. If you have any additions/comments/feedback, leave a comment below. If you agree with everything, leave a comment to say so.
[ ] Step 2: Put the datasets in a single long-term location. Where should the dataset lives? For instance, somewhere in: /glade/p/cesmdata/cseg/inputdata or /glade/p/cesm/amwg or ??? Should we have a repo to outside people can access them?
[ ] Step 3: Rename datasets and add metadata as needed. It is important to have metadata that indicates version, creator, etc. A while back, we had started to require AMWG metadata for new any AMWG dataset. Attempt for CESM metadata was discussed here

Would love feedback on next steps (@JulioTBacmeister, @swrneale, @andrewgettelman)

andrewgettelman commented 2 years ago

The list looks good.

For cloud 'microphysics' (CLOUD, TGCLDLWP, TAU, REFF) we probably want to be using the COSP MODIS simulator and associated observations. Especially for cloud fraction. We do have this data from the existing diagnostics, we just have to compare to different fields.

We could also add U, V, T, and probably compare to ERA-Interim (or ERA5). We should have this data already.

brianpm commented 2 years ago

I think this list is good for a starting point. I agree with Andrew about using COSP diagnostics for some of this, including cloud cover (e.g., we should not compare CLDLOW to satellite "low cloud" products). Maybe these can be added incrementally?

While we are in transition to handling a larger and more diverse group of observations with intake-esm (https://github.com/NCAR/ADF/issues/102 , https://github.com/NCAR/ADF/issues/25), a stop-gap solution might be to make a combined, homogenized dataset out of this list. By this, I just mean take these handful of datasets and remap them to the FV1° grid (or whatever grid we want to use for evaluation for the rest of 2022), name them to the corresponding CAM variables, put a comment or something into each one's metadata to indicate the source and time-span, and put them all into climo files that match the ADF convention. That might make it easy to deal with them very similarly as we do with CAM cases within the ADF scripts. Maybe @nusbaume would disagree though?

In the medium-term, I would suggest that we need to update a lot of these datasets. In some cases because there are new (hopefully improved) products (e.g., CERES Ed2.x -> CERES Ed4.1, ERAI -> ERA5). In other cases, just because the observations now cover a longer time period. And I totally agree with next step numbers 2 & 3, finding a permanent place and getting good metadata is crucial.

bitterbark commented 2 years ago

Should one of the highest priorities be to easily swap in a new data set? In the short term, having an easy-to-find list of the files to use, that a user could change and ADF uses to determine which one to read in, should make that a lot easier. Are we thinking of putting this in the variable attributes information? If so, that would work too.

That unfortunately argues against a combined data set that would have to be remade every time any one changes. Although I think having an already-regridded version of each file would be valuable.

andrewgettelman commented 2 years ago

I second Dani's comment: better if it is incremental.

I think all that would be needed is if we had variable attributes for obs_file_name obs_var_name obs_scl

Then as long as there was lat,lon,pressure (if needed) and a reasonable monthly time coordinate, I think the plotting codes could figure out how to load the observations for any variable and get them in the right units. Flexible, could be changed, some manual intervention, but I think that's fine. Also iterative.

swrneale commented 2 years ago

@cecilehannay I presume this is for ANN,DJF->SON,JAN->DEC climo. datasets? I think I agree with Brian, that we should grab updates to the datasets and put them mostly in the existing format so that Cecile could reasonably transition to ADP for CAM5 dev. simulations.

Do we want to make any attempt to overlap observational periods, or should we just grab the longest periods we can each time?

In terms of observational fields we should consult the https://climatedataguide.ucar.edu/ before we do some heavy lifting ourselves. A few things missing: Surface latent and sensible heats (not analysis based) and precipitable water (NCAP available 1988-2009 https://asdc.larc.nasa.gov/project/NVAP-M/NVAP_CLIMATE_Total-Precipitable-Water_1)

Rich

nusbaume commented 2 years ago

In case this helps, my general long-term plan for dealing with observations currently looks something like this:

There is an "official" Intake-ESM catalog that contains all of the (gridded) observational datasets that we want to compare the model against by default. I believe if set-up properly this catalog can manage any sort of temporal resolution and spatial grid, but there will likely need to be fairly strict meta-data requirements for those observational files in order for the ADF to properly search for them in the catalog. In general too I imagine a dataset wouldn't be added to this catalog until it was "blessed" by someone at AMP.
The ADF will also support the ability for the user to specify their own, non-official observational datasets, which can be done via the current variable meta-data YAML file. I am happy if we want to require certain observational file features and meta-data, but I was planning to have the ADF basically accept almost anything and try its best to match the model data to that observational file. This would allow a user to easily add their own observations while protecting the "official" observational data most users will want.

For the short-term I am basically going to implement option 2. Then once that is working I can bring in the infrastructure needed for Intake-ESM, while at the same time we collectively agree on and update the observational datasets we want. At that point I can create the "official" catalog, and we should then have both options available.

andrewgettelman commented 2 years ago

Good plan. I like starting with option 2 and keeping it simple for now, so we can get going with existing data sets, and allow easy extensibility and minimal overhead. When option 1 is on line, we can start to migrate obs over. But that should happen later. Thanks!

nusbaume commented 2 years ago

Hi All,

I just wanted to notify everyone possibly watching this thread that I have recently implemented model vs obs comparisons using the variable defaults file to specify what observational data set to use. You can currently specify the observational file to use (either as just a file name or as a full path if it is located somewhere unique), the name of the observational data set (which will eventually be plugged into plot titles, webpages, etc.), and the name of the variable on the observational file that you want to use (so multiple observational variables can be located in a single file).

Currently the observations can be on any structured lat/lon grid you want, and the only new meta-data requirement is that the observations variable must have a units attribute (which is probably a good idea in general).

In terms of missing features, there is currently no way to deal with 3-D observational data, and the ADF assumes that the observations themselves are monthly climatologies with a time dimension of length 12 (one for each month). Example files can be found on Cheyenne/Casper here (with credit going to @brianpm for the data files themselves):

/glade/work/nusbaume/SE_projects/model_diagnostics/ADF_obs

Of course I am hoping to remove all of these restrictions eventually, so if you have an observational data set you want to use that has a vertical dimension, or that has a different time dimension (e.g. seasonal or daily values) please let me know and I'll help with adding the necessary ADF functionality.

Thanks, and have a good weekend!

cecilehannay commented 2 years ago

@nusbaume: I am a bit confused how to run versus obs.

cecilehannay commented 2 years ago

From @nusbaume: set "compare_obs" to "true" in your config file

andrewgettelman commented 2 years ago

I've done the radiation fields from CERES and some from ERAI. It's pretty simple to do this using the lib/adf_variable_defaults.yaml

I have also added the ability to scale and change units for the data sets (observations and variable independently).

Should be easy to finish this off, and would be a good easy hackathon project....

andrewgettelman commented 2 years ago

Adding some notes from @chengzhuzhang (Jill) at LLNL on the E3SM diagnostics and observations:

Thank you for your feedback! I looked a bit into the provenance of the AODVIS dataset, this observational composite dataset was derived based on the MACv1 (Max-Planck-Institute Aerosol Climatology) dataset from MPI. https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jame.20035

In a recent effort to get in more aerosol diagnostics, I found a more recent version MACv2 available (https://www.tandfonline.com/doi/full/10.1080/16000889.2019.1623639), that I haven’t had a chance to integrate the new dataset.

All of our processed datasets are available publicly at https://web.lcrc.anl.gov/public/e3sm/diagnostics/observations/Atm/,

which contains 4 sub directories that supply data to E3SM diags:

/climatology #include seasonal and annual mean climatology

/time-series # include time-series datasets if available

/arm-diags-data # time-series derived from ARM facilities

/tc-analysis # tropical cycle tracks datasets

The AOD datasets is here: https://web.lcrc.anl.gov/public/e3sm/diagnostics/observations/Atm/climatology/AOD_550/

And as you pointed out, some datasets were grabbed from AMWG, including CRU, SSMI and COSP-OBS datasets. I only assembled the original time series from individual simulators, It would be great if we can get some help or work together for the COSP-OBS!

Thanks,

Jill

brianpm commented 2 years ago

FWIW, I have processed the latest MODIS data into a first draft of a climo file for use in ADF. This is based on initial processing by @jshaw35. The file is here: /glade/work/brianpm/observations/MODIS/climo/MCD06COSP_M3_MODIS.climo.200301-202012.nc

The variables are: CLMODIS (the histogram), CLTMODIS, CLHMODIS, CLMMODIS, CLLMODIS, CLDTHCK_MODIS (high, optically thick clouds), and cloud_mask (which won't be used in general).

Averaging interval is 2003-2020.

UNTESTED so if you see anything weird, just let me know and I can re-process.

chengzhuzhang commented 2 years ago

@brianpm This is great! You beat me to it. I'd be happy to test the datasets. My NCAR computer account was expired about two years ago. I'm submitting a new account request to get data from GLADE.

brianpm commented 2 years ago

Okay, another first draft climatology, this time from CALIPSO GOCCP.

Original data from https://climserv.ipsl.polytechnique.fr/cfmip-obs/Calipso_goccp.html

I took the monthly data for cloud cover maps, cloud phase maps, and cloud fraction profiles and processed them to the monthly climatology. Simple averaging with nothing else going on. It is worth noting that no correction/adjustment is made for the South Atlantic Anomaly, so caution is advised. Global averages should probably mask out the affected region.

Files on glade:

/glade/work/brianpm/observations/clcalipso/climo/3D_CloudFraction330m_200606-202012_climo_CFMIP2_sat_3.1.2.nc
/glade/work/brianpm/observations/clcalipso/climo/MapLowMidHigh330m_200606-202012_climo_CFMIP2_sat_3.1.2.nc
/glade/work/brianpm/observations/clcalipso/climo/MapLowMidHigh_Phase330m_200606-202012_climo_CFMIP2_sat_3.1.2.nc

chengzhuzhang commented 2 years ago

MACv2 (Max-Planck-Institute Aerosol Climatology) is now processed. ref: https://www.tandfonline.com/doi/full/10.1080/16000889.2019.1623639

/glade/work/chengzhu/analysis_data_e3sm_diags/MACv2/climatology

I extracted aod at 550nm, more aerosol optical properties are available if interested, they are under MACv2/original_full_set. Limited testing has been done, for AOD55nm, the global mean matches values in the paper.

Not sure if the AMWG diags style, monthly and seasonal mean climo files still useful for ADF, but i think they should be easily to be converted to the 12 months based climo. My processing script is under MACv2/scripts.

brianpm commented 2 years ago

ISCCP climatology. This is updated to use the ISCCP-H series data. The CTP-TAU histograms are not available in the monthly "basic" files. @Isaaciwd produced monthly mean files from the 3-hourly files. We validated that the values are close to the available monthly means; there is a slight discrepancy that appears to be attributable to the difference in the order of operations vis-à-vis when remapping from equal area to equal angle grids occurs. I took the derived monthly files and made this climo file:

/glade/work/brianpm/observations/isccp/climo/ISCCP-Basic.HGG.GLOBAL.10KM.climo.198307-201706.nc

This file contains:

FISCCP1_COSP : the CTP-TAU histogram (7x6)
MEANPTOP_ISCCP : mean cloud-top pressure
MEANTAU_ISCCP : mean cloud optical thickness
cosp_prs_bnds : bin bounds for pressure
cosp_tau_bnds : bin bounds for tau
CLDHGH_ISCCP : high cloud fraction
CLDMED_ISCCP : middle cloud fraction
CLDLOW_ISCCP : low cloud fraction
CLDTOT_ISCCP : total cloud fraction

These were renamed to match CAM's COSP outputs, but not much else was checked for consistency with CAM.

chengzhuzhang commented 2 years ago

@brianpm thank you for the update. It is great that the coordinates/variable names are reformatted to match CAM. I'm wondering if you could share the script, I'm thinking to generate processing scripts that can write out two versions of data (the new ADF version and the AMWG seasonal mean version). Thank you!

brianpm commented 2 years ago

@chengzhuzhang -- Yes, I can definitely share the script. I noticed that the CLDTOT_ISCCP values are higher than I expected. I'm going to try to check on that today (mainly I can't remember if I am supposed to skip the lowest optical depth bin). I'll confirm and remake the file if needed. I can put my script somewhere (maybe just in one of my github repos). I will try to get Isaac's script for doing the 3hr-to-monthly calculation, too.

NCAR / ADF