claudiatebaldi / hectorcal

3 stars 3 forks source link

Update package data #27

Closed rplzzz closed 4 years ago

rplzzz commented 5 years ago

We need to make sure that all of the package data has been updated with the latest ESM data that we acquired. It looks like cmip_individual and esm_comparison were both updated early in September, but I'm not sure whether that was before or after the last new data we got, so it wouldn't hurt to check that everything made its way in there. The other data sets appear not to have been updated recently.

The data sets we need to check on are:

rplzzz commented 5 years ago

@kdorheim What is the story with the units column in the newest version of cmip_individual? It looks bogus. Any reason not to drop it before saving into the package data?

kdorheim commented 5 years ago

@rplzzz Hi I am in the Denver air port at the moment, I'll take a look at this when I get home. Can we meet tomorrow to talk about how to divide up the tasks you need for the mcmc?

rplzzz commented 4 years ago

Ok, the package data for cmip_individual and esm_comparison doesn't match what is produced by data-raw/ESM_processing_code/C.esm-comparison-data.R. The most puzzling thing about it is that one of the discrepancies is that cmip_individual has a "units" column:

> names(hectorcal::cmip_individual)
[1] "year"         "units"        "model"        "variable"     "experiment"   "ensemble"    
[7] "value"        "esmbaseline"  "concbaseline"

However, there has never been a "units" column in the input data (the units string is read in in a column called "unit"), and in any case we've been dropping the unit column since at least January: https://github.com/kdorheim/hectorcal/blame/master/data-raw/ESM_processing_code/C.esm-comparison-data.R#L173-L177

At this point I think I'm going to rerun the data scripts and update the package data to use whatever comes out. @kdorheim, if you have time, could you compare the output of the data scripts for these two datasets to what you have on your local system and see if they are identical? If so, then we can write this off as a git mixup.

kdorheim commented 4 years ago

@rplzzz Can do what is your time frame for this so I can triage this task accordingly?

rplzzz commented 4 years ago

At your convenience. I'm going to proceed under the assumption that the scripts in data-raw are producing the correct data. If you find that's not the case, let me know.

kdorheim commented 4 years ago

Can do and will do! I should have time to look into this the second half of this week.

rplzzz commented 4 years ago

Ok, I guess I'm confused here. What I thought I remembered was that when we went out and got the additional CMIP data, we found that some of the new models we added were missing CO2 data toward the end of the historical period. However, when I take the intersection of all of the available years by scenario and variable, I'm not seeing any missing data.

Update: it was one model. Found the discussion here: https://github.com/kdorheim/hectorcal/pull/26#issuecomment-530354455 The MRI-ESM1 model doesn't appear to have made it into the new package data:

> filter(cmip_individual, experiment == 'esmHistorical', model=='MRI-ESM1')
# A tibble: 0 x 8
# … with 8 variables: year <dbl>, model <chr>, ensemble <chr>, variable <chr>, experiment <chr>,
#   value <dbl>, esmbaseline <dbl>, concbaseline <dbl>

@kdorheim If you have a chance, could you make sure that the package data contains all of the new models you downloaded? You can add it to the draft PR that I have open.

kdorheim commented 4 years ago

@rplzzz can do! Can you share the version of the the files that you have on your local? And the date they are from?

rplzzz commented 4 years ago

I'm working from commit 18916e1.

The last change to the ESM data appears to have been on 18 Oct:

commit 29d4cca
Author: Kalyn <kalyn.dorheim@pnnl.gov>
Date:   Fri Oct 18 20:32:58 2019 -0400

    clean up

M       data-raw/CMIP5_annual_global_average.csv

If there are some particular files you need a status for, I can give you the lowdown on those as well. I'm working from the versions checked into the repository, so you should also be able to find some information in the logs.

kdorheim commented 4 years ago

UPDATE: after looking into this I have uncovered some errors in my ESM_processing_code 🙄

kdorheim commented 4 years ago

This is turning out to be quite the nightmare, I think whenever I do this sort of this I am going to implement drake

rplzzz commented 4 years ago

Sorry to hear that; I've been there more than a few times myself. Let me know if there is anything I can do to help.

kdorheim commented 4 years ago

Okay we have some how both lost and gained data 🤔

Here is a kable of the data we "lost" in some cases we only lost data in a few years.

model variable experiment years_missing
CCSM4 heatflux historical 7
CESM1-BGC tas rcp45 95
GFDL-CM3 tas rcp60 90
GFDL-ESM2G co2 rcp60 90
GFDL-ESM2G tas rcp60 90
GISS-E2-H tas rcp60 50
GISS-E2-R tas rcp60 75
HadCM3 tas rcp45 30
HadGEM2-AO tas historical 146
HadGEM2-AO tas rcp26 95
HadGEM2-AO tas rcp45 95
HadGEM2-AO tas rcp60 95
HadGEM2-AO tas rcp85 95
MIROC-ESM-CHEM heatflux rcp45 1
MRI-CGCM3 heatflux historical 5
NorESM1-ME tas rcp60 50
kdorheim commented 4 years ago

While here is a kable of the data we gained.

model variable experiment
CanESM2 heatflux esmHistorical
CanESM2 heatflux esmrcp85
CESM1-BGC heatflux esmHistorical
CESM1-BGC heatflux esmrcp85
CNRM-CM5 tas historical
CNRM-CM5 tas rcp85
CSIRO-Mk3-6-0 tas historical
CSIRO-Mk3-6-0 tas rcp26
CSIRO-Mk3-6-0 tas rcp45
CSIRO-Mk3-6-0 tas rcp60
CSIRO-Mk3-6-0 tas rcp85
CSIRO-Mk3L-1-2 heatflux historical
EC-EARTH tas rcp26
EC-EARTH tas rcp45
EC-EARTH tas rcp85
MIROC-ESM co2 esmHistorical
MIROC-ESM co2 esmrcp85
MIROC-ESM heatflux esmHistorical
MIROC-ESM heatflux esmrcp85
MIROC-ESM tas esmHistorical
MIROC-ESM tas esmrcp85
MPI-ESM-LR co2 esmHistorical
MPI-ESM-LR co2 esmrcp85
MPI-ESM-LR tas esmHistorical
MPI-ESM-LR tas esmrcp85
MRI-ESM1 co2 esmHistorical
MRI-ESM1 co2 esmrcp85
MRI-ESM1 heatflux esmHistorical
MRI-ESM1 heatflux esmrcp85
MRI-ESM1 tas esmHistorical
MRI-ESM1 tas esmrcp85
NorESM1-ME co2 esmHistorical
NorESM1-ME co2 esmrcp85
NorESM1-ME heatflux esmHistorical
NorESM1-ME heatflux esmrcp85
NorESM1-ME tas esmHistorical
NorESM1-ME tas esmrcp85
kdorheim commented 4 years ago

Okay after much confusion I think I have arrived at our final data.

kdorheim commented 4 years ago

The comparison data I have now does not change the concentration max and min values but will change the concentration driven mean/median. The comparison data committed in eae1d2a did not have na.rm = TRUE when calculating the multi-model mean and median and so and which caused problems with the mean temperature data

image

kdorheim commented 4 years ago

See branch krd_issue27

kdorheim commented 4 years ago

Now we have emissions driven data from the following models

 "CanESM2"    "CESM1-BGC"  "GFDL-ESM2M" "MRI-ESM1"   "NorESM1-ME" "GFDL-ESM2G" "MIROC-ESM" 

So instead of only 3 models we have a total of 7 models!

If we compare what the old vs new emissions values are image we see that the max values do not change as much as the min values especially for co2, heatflux, and tas.

rplzzz commented 4 years ago

See branch krd_issue27

Currently that branch is showing as even with master. Do you still have more to do before you push your changes?

kdorheim commented 4 years ago

:big_sigh: I don't exactly understand what is going on. Looks like git command line vs git r studio gui are telling me different things.

kdorheim commented 4 years ago

@rplzzz Okay git issue resolved sort of, long story, but the branch krd_issue27 actually has the commits pushed to the remote branch.

rplzzz commented 4 years ago

Honestly, I find all the GUIs to be more confusing than helpful. With the command line tools I know what it is that I asked the system to do, and I can verify that it did it.

I'll take a look at the new data today or tomorrow, and if it looks like we have everything we need, then we can close this issue out.

kdorheim commented 4 years ago

Awesome!

rplzzz commented 4 years ago

Now we have emissions driven data from the following models

 "CanESM2"    "CESM1-BGC"  "GFDL-ESM2M" "MRI-ESM1"   "NorESM1-ME" "GFDL-ESM2G" "MIROC-ESM" 

So instead of only 3 models we have a total of 7 models!

However, the package data seems to have only 5, and they're not a subset of the 7 you have above:

> filter(cmip_individual, experiment=='esmrcp85') -> esmrcp85
> unique(esmhist$model)
[1] "bcc-csm1-1-m" "CanESM2"      "CESM1-BGC"    "GFDL-ESM2G"   "inmcm4" 

This is with the latest of the krd_issue27 branch.

Possibly the rda files in data were stale when you committed them. Can you double check and commit the up-to-date versions if necessary?

kdorheim commented 4 years ago

I re-ran the script that generates the comparison data and pushed the results.

> cmip_individual %>% filter(grepl('esm', experiment)) %>%  pull(model) %>%  unique()
[1] "CanESM2"    "CESM1-BGC"  "GFDL-ESM2M" "MRI-ESM1"   "NorESM1-ME" "GFDL-ESM2G" "MIROC-ESM"  
kdorheim commented 4 years ago

As we discussed on slack there were some issues with the comparison data among them being multiple entries for for a single model / year / ensemble member.

They were being caused for multiple reasons

  1. The code was pulling from old intermediate netcdf files that had not been properly cleaned up
  2. Some of the cdo code was processing daily data and monthly data
  3. Some of the co2 files had duplicate entries in them for the first year of a netcdf file and because some modeling groups save output data in multiple netcdf files we'd see these duplicated in the " middle" of the experiment, when that happened I removed the duplicate entires and and replaced with interpolated values. I wouldn't do this with temperature of heat flux data because there is so much inter-annual variability but I feel justified we can make this assumption with CO2, if you disagree please let me know.

Here are the models we have the emission driven data for

  model     
  <chr>     
1 CanESM2   
2 CESM1-BGC 
3 GFDL-ESM2G
4 GFDL-ESM2M
5 MIROC-ESM 
6 MRI-ESM1  
7 NorESM1-ME

Now when I check to see if there are any duplicates there are none! yay! (also I've added some assertthat checks to make sure that this doesn't happen again)

> cmip_individual %>% filter(grepl('esm', experiment)) %>% group_by(experiment, variable, model, year, ensemble) %>% summarise(year_count = n()) %>% ungroup %>% filter(year_count != 1) 
# A tibble: 0 x 6
# … with 6 variables: experiment <chr>, variable <chr>, model <chr>, year <dbl>, ensemble <chr>,
#   year_count <int>

This plot compares the min/max ESM range with the individual cmip model output. The thick dark black line is the multi-model mean from esm_comparison.

image

The min CO2 range is from MRI-ESM1 which yes I know is low but the results have been published else where including Hartin et al. 2015.

kdorheim commented 4 years ago

When you get a chance to look at this please let me know what you think.

rplzzz commented 4 years ago

This plot compares the min/max ESM range with the individual cmip model output. The thick dark black line is the multi-model mean from esm_comparison.

This definitely looks better. One thing I noticed is there seems to be some kind of anomaly in the CO2 data at approximately 2030. It looks like CESM1-BGC has an unusually low value that year. Since it's the highest, that affects the max envelope, and the drop is large enough to produce a noticeable effect on the multi-model mean.

I'll take a closer look, and if it's only one bad value, I think we should insert a special case to just interpolate across the gap for that model. Then I should be able to update the other datasets listed above, and we can finally rerun the calcs for the paper.

The min CO2 range is from MRI-ESM1 which yes I know is low but the results have been published else where including Hartin et al. 2015.

Sounds reasonable to me. If the rest of the community is happy with it, then who are we to dispute them?