Update package data - Githubissues

rplzzz commented 5 years ago

We need to make sure that all of the package data has been updated with the latest ESM data that we acquired. It looks like cmip_individual and esm_comparison were both updated early in September, but I'm not sure whether that was before or after the last new data we got, so it wouldn't hurt to check that everything made its way in there. The other data sets appear not to have been updated recently.

The data sets we need to check on are:

[x] cmip_individual
[x] esm_comparison
[ ] cmip_conc_pcproj
[ ] cmip_emiss_pcproj
[ ] conc_pc_comparison
[ ] emiss_pc_comparison

rplzzz commented 5 years ago

@kdorheim What is the story with the units column in the newest version of cmip_individual? It looks bogus. Any reason not to drop it before saving into the package data?

kdorheim commented 5 years ago

@rplzzz Hi I am in the Denver air port at the moment, I'll take a look at this when I get home. Can we meet tomorrow to talk about how to divide up the tasks you need for the mcmc?

rplzzz commented 4 years ago

Ok, the package data for cmip_individual and esm_comparison doesn't match what is produced by data-raw/ESM_processing_code/C.esm-comparison-data.R. The most puzzling thing about it is that one of the discrepancies is that cmip_individual has a "units" column:

> names(hectorcal::cmip_individual)
[1] "year"         "units"        "model"        "variable"     "experiment"   "ensemble"    
[7] "value"        "esmbaseline"  "concbaseline"

However, there has never been a "units" column in the input data (the units string is read in in a column called "unit"), and in any case we've been dropping the unit column since at least January: https://github.com/kdorheim/hectorcal/blame/master/data-raw/ESM_processing_code/C.esm-comparison-data.R#L173-L177

At this point I think I'm going to rerun the data scripts and update the package data to use whatever comes out. @kdorheim, if you have time, could you compare the output of the data scripts for these two datasets to what you have on your local system and see if they are identical? If so, then we can write this off as a git mixup.

kdorheim commented 4 years ago

@rplzzz Can do what is your time frame for this so I can triage this task accordingly?

rplzzz commented 4 years ago

At your convenience. I'm going to proceed under the assumption that the scripts in data-raw are producing the correct data. If you find that's not the case, let me know.

kdorheim commented 4 years ago

Can do and will do! I should have time to look into this the second half of this week.

rplzzz commented 4 years ago

Ok, I guess I'm confused here. What I thought I remembered was that when we went out and got the additional CMIP data, we found that some of the new models we added were missing CO₂ data toward the end of the historical period. However, when I take the intersection of all of the available years by scenario and variable, I'm not seeing any missing data.

Update: it was one model. Found the discussion here: https://github.com/kdorheim/hectorcal/pull/26#issuecomment-530354455 The MRI-ESM1 model doesn't appear to have made it into the new package data:

> filter(cmip_individual, experiment == 'esmHistorical', model=='MRI-ESM1')
# A tibble: 0 x 8
# … with 8 variables: year <dbl>, model <chr>, ensemble <chr>, variable <chr>, experiment <chr>,
#   value <dbl>, esmbaseline <dbl>, concbaseline <dbl>

@kdorheim If you have a chance, could you make sure that the package data contains all of the new models you downloaded? You can add it to the draft PR that I have open.

kdorheim commented 4 years ago

@rplzzz can do! Can you share the version of the the files that you have on your local? And the date they are from?

rplzzz commented 4 years ago

I'm working from commit 18916e1.

The last change to the ESM data appears to have been on 18 Oct:

commit 29d4cca
Author: Kalyn <kalyn.dorheim@pnnl.gov>
Date:   Fri Oct 18 20:32:58 2019 -0400

    clean up

M       data-raw/CMIP5_annual_global_average.csv

If there are some particular files you need a status for, I can give you the lowdown on those as well. I'm working from the versions checked into the repository, so you should also be able to find some information in the logs.

kdorheim commented 4 years ago

UPDATE: after looking into this I have uncovered some errors in my ESM_processing_code 🙄

kdorheim commented 4 years ago

This is turning out to be quite the nightmare, I think whenever I do this sort of this I am going to implement drake

rplzzz commented 4 years ago

Sorry to hear that; I've been there more than a few times myself. Let me know if there is anything I can do to help.

kdorheim commented 4 years ago

Okay we have some how both lost and gained data 🤔

Here is a kable of the data we "lost" in some cases we only lost data in a few years.

model	variable	experiment	years_missing
CCSM4	heatflux	historical	7
CESM1-BGC	tas	rcp45	95
GFDL-CM3	tas	rcp60	90
GFDL-ESM2G	co2	rcp60	90
GFDL-ESM2G	tas	rcp60	90
GISS-E2-H	tas	rcp60	50
GISS-E2-R	tas	rcp60	75
HadCM3	tas	rcp45	30
HadGEM2-AO	tas	historical	146
HadGEM2-AO	tas	rcp26	95
HadGEM2-AO	tas	rcp45	95
HadGEM2-AO	tas	rcp60	95
HadGEM2-AO	tas	rcp85	95
MIROC-ESM-CHEM	heatflux	rcp45	1
MRI-CGCM3	heatflux	historical	5
NorESM1-ME	tas	rcp60	50

kdorheim commented 4 years ago

While here is a kable of the data we gained.

model	variable	experiment
CanESM2	heatflux	esmHistorical
CanESM2	heatflux	esmrcp85
CESM1-BGC	heatflux	esmHistorical
CESM1-BGC	heatflux	esmrcp85
CNRM-CM5	tas	historical
CNRM-CM5	tas	rcp85
CSIRO-Mk3-6-0	tas	historical
CSIRO-Mk3-6-0	tas	rcp26
CSIRO-Mk3-6-0	tas	rcp45
CSIRO-Mk3-6-0	tas	rcp60
CSIRO-Mk3-6-0	tas	rcp85
CSIRO-Mk3L-1-2	heatflux	historical
EC-EARTH	tas	rcp26
EC-EARTH	tas	rcp45
EC-EARTH	tas	rcp85
MIROC-ESM	co2	esmHistorical
MIROC-ESM	co2	esmrcp85
MIROC-ESM	heatflux	esmHistorical
MIROC-ESM	heatflux	esmrcp85
MIROC-ESM	tas	esmHistorical
MIROC-ESM	tas	esmrcp85
MPI-ESM-LR	co2	esmHistorical
MPI-ESM-LR	co2	esmrcp85
MPI-ESM-LR	tas	esmHistorical
MPI-ESM-LR	tas	esmrcp85
MRI-ESM1	co2	esmHistorical
MRI-ESM1	co2	esmrcp85
MRI-ESM1	heatflux	esmHistorical
MRI-ESM1	heatflux	esmrcp85
MRI-ESM1	tas	esmHistorical
MRI-ESM1	tas	esmrcp85
NorESM1-ME	co2	esmHistorical
NorESM1-ME	co2	esmrcp85
NorESM1-ME	heatflux	esmHistorical
NorESM1-ME	heatflux	esmrcp85
NorESM1-ME	tas	esmHistorical
NorESM1-ME	tas	esmrcp85

kdorheim commented 4 years ago

Okay after much confusion I think I have arrived at our final data.

kdorheim commented 4 years ago

The comparison data I have now does not change the concentration max and min values but will change the concentration driven mean/median. The comparison data committed in eae1d2a did not have na.rm = TRUE when calculating the multi-model mean and median and so and which caused problems with the mean temperature data

kdorheim commented 4 years ago

See branch krd_issue27

kdorheim commented 4 years ago

Now we have emissions driven data from the following models

 "CanESM2"    "CESM1-BGC"  "GFDL-ESM2M" "MRI-ESM1"   "NorESM1-ME" "GFDL-ESM2G" "MIROC-ESM"

So instead of only 3 models we have a total of 7 models!

If we compare what the old vs new emissions values are we see that the max values do not change as much as the min values especially for co2, heatflux, and tas.

rplzzz commented 4 years ago

See branch krd_issue27

Currently that branch is showing as even with master. Do you still have more to do before you push your changes?

kdorheim commented 4 years ago

:big_sigh: I don't exactly understand what is going on. Looks like git command line vs git r studio gui are telling me different things.

kdorheim commented 4 years ago

@rplzzz Okay git issue resolved sort of, long story, but the branch krd_issue27 actually has the commits pushed to the remote branch.

rplzzz commented 4 years ago

Honestly, I find all the GUIs to be more confusing than helpful. With the command line tools I know what it is that I asked the system to do, and I can verify that it did it.

I'll take a look at the new data today or tomorrow, and if it looks like we have everything we need, then we can close this issue out.

kdorheim commented 4 years ago

Awesome!

rplzzz commented 4 years ago

Now we have emissions driven data from the following models
 "CanESM2"    "CESM1-BGC"  "GFDL-ESM2M" "MRI-ESM1"   "NorESM1-ME" "GFDL-ESM2G" "MIROC-ESM" 
So instead of only 3 models we have a total of 7 models!

However, the package data seems to have only 5, and they're not a subset of the 7 you have above:

> filter(cmip_individual, experiment=='esmrcp85') -> esmrcp85
> unique(esmhist$model)
[1] "bcc-csm1-1-m" "CanESM2"      "CESM1-BGC"    "GFDL-ESM2G"   "inmcm4"

This is with the latest of the krd_issue27 branch.

Possibly the rda files in data were stale when you committed them. Can you double check and commit the up-to-date versions if necessary?

kdorheim commented 4 years ago

I re-ran the script that generates the comparison data and pushed the results.

> cmip_individual %>% filter(grepl('esm', experiment)) %>%  pull(model) %>%  unique()
[1] "CanESM2"    "CESM1-BGC"  "GFDL-ESM2M" "MRI-ESM1"   "NorESM1-ME" "GFDL-ESM2G" "MIROC-ESM"

kdorheim commented 4 years ago

As we discussed on slack there were some issues with the comparison data among them being multiple entries for for a single model / year / ensemble member.

They were being caused for multiple reasons

The code was pulling from old intermediate netcdf files that had not been properly cleaned up
Some of the cdo code was processing daily data and monthly data
Some of the co2 files had duplicate entries in them for the first year of a netcdf file and because some modeling groups save output data in multiple netcdf files we'd see these duplicated in the " middle" of the experiment, when that happened I removed the duplicate entires and and replaced with interpolated values. I wouldn't do this with temperature of heat flux data because there is so much inter-annual variability but I feel justified we can make this assumption with CO2, if you disagree please let me know.

Here are the models we have the emission driven data for

  model     
  <chr>     
1 CanESM2   
2 CESM1-BGC 
3 GFDL-ESM2G
4 GFDL-ESM2M
5 MIROC-ESM 
6 MRI-ESM1  
7 NorESM1-ME

Now when I check to see if there are any duplicates there are none! yay! (also I've added some assertthat checks to make sure that this doesn't happen again)

> cmip_individual %>% filter(grepl('esm', experiment)) %>% group_by(experiment, variable, model, year, ensemble) %>% summarise(year_count = n()) %>% ungroup %>% filter(year_count != 1) 
# A tibble: 0 x 6
# … with 6 variables: experiment <chr>, variable <chr>, model <chr>, year <dbl>, ensemble <chr>,
#   year_count <int>

This plot compares the min/max ESM range with the individual cmip model output. The thick dark black line is the multi-model mean from esm_comparison.

The min CO2 range is from MRI-ESM1 which yes I know is low but the results have been published else where including Hartin et al. 2015.

kdorheim commented 4 years ago

When you get a chance to look at this please let me know what you think.

rplzzz commented 4 years ago

This plot compares the min/max ESM range with the individual cmip model output. The thick dark black line is the multi-model mean from esm_comparison.

This definitely looks better. One thing I noticed is there seems to be some kind of anomaly in the CO₂ data at approximately 2030. It looks like CESM1-BGC has an unusually low value that year. Since it's the highest, that affects the max envelope, and the drop is large enough to produce a noticeable effect on the multi-model mean.

I'll take a closer look, and if it's only one bad value, I think we should insert a special case to just interpolate across the gap for that model. Then I should be able to update the other datasets listed above, and we can finally rerun the calcs for the paper.

The min CO2 range is from MRI-ESM1 which yes I know is low but the results have been published else where including Hartin et al. 2015.

Sounds reasonable to me. If the rest of the community is happy with it, then who are we to dispute them?

claudiatebaldi / hectorcal

Update package data #27