review climate data - Githubissues

teixeirak commented 3 years ago

Our latest plot of the climate data has some obviously wrong values, including a strange line of data and a handful that are out of range. I assume these came in with GROA or maybe SRDB (I think it's been a long time since I've looked at this figure).

bpbond commented 3 years ago

Ummmm I've never seen anything like that in the SRDB data 🤔 but 🤷‍♂️

teixeirak commented 3 years ago

I just checked and it was not there after we added SRDB. So, it presumably comes from GROA.

bpbond commented 3 years ago

Phew :)

teixeirak commented 3 years ago

There appears to be some bad data from Taylor_2017_tari: very low precip for at least 4 sites in India (e.g., Kodayar I, IV...). Data came from here: https://data.nceas.ucsb.edu/view/knb.1274.1. @beckybanbury , do you have an inntermediary data file with this climate info?

teixeirak commented 3 years ago

I believe that strange line of data comes mostly or entirely from Xu_2015_proa, imported via GROA. I don't see an explanation in the paper as to climate data source, but study system includes 164 plots spanning elevation gradient, so presumably climate is extrapolated as a function of elevation. I'm not sure if this is the most accurate possible for that location, but at least we have an explanation. I haven't verified against original data yet, but it seems unlikely that this is an error in GROA or ForC.

Here's

MAP vs MAT for Xu_2015_proa sites:

beckybanbury commented 3 years ago

@teixeirak yes, that's an error - the intermediate data sheet is in this folder - it's the litterfall data file, and by the looks of things I just accidentally copied across the values from the adjacent column.

I've corrected in ForC_sites

Are there any others that look out?

teixeirak commented 3 years ago

Many thanks, @beckybanbury , and sorry to ping you on a weekend. No need to respond right away. (I'm pushing to solve some problems for a deadline this week, but we can just avoid questionable records at this point.)

It's hard to say if everything looks right now. Taylor 2017 has a number of records with very high precip, and so I started trying to check some. I verified that one was correct (Swer) but found one error (Wooroonooran National Park Bellenden Ker). Then, I got caught up trying to understand what's going on with the La Fortuna Forest Reserve, which has 5 sites with identical coordinates but different climate entered, but only seems to have one site when you go to the original pub. That will need to be solved, but I have to drop it for now.

teixeirak commented 3 years ago

I have reviewed the most egregious outliers. However, there are almost certainly some errors. It would be good to check the ForC climate data against a global database to identify values that are way off (e.g., units error during data entry).

beckybanbury commented 3 years ago

@teixeirak happy to help with this if you'd like, particularly the data from Taylor 2017 that I entered - I remember reviewing some of the C flux values that looked off at the time, but didn't check climate data so closely. Happy to spend some time reviewing if you'd like - just let me know how you want to approach this!

teixeirak commented 3 years ago

Thanks, @beckybanbury. I sent an email about this. More narrowly, figuring out what's going on with La Fortuna (see here) would be helpful.

teixeirak commented 3 years ago

@beckybanbury , thanks for working on this!

Based on the plots, let's flag sites "climate.data.suspect" if any of the following are true:

temperature difference > 5C
warmest or coldest month difference > 7.5C
log(precip) difference >1

You could just flag with a "1", or better yet list the variable(s) that is/are off.

teixeirak commented 3 years ago

@beckybanbury , if you're able to complete the step above this week while @Troger4 is still with us, she could check the climate values that are way off.

beckybanbury commented 3 years ago

@teixeirak sorry - somehow I missed your previous comment! I've flagged with the name of the variable that is suspect. It doesn't look like there's too many.

teixeirak commented 3 years ago

@Troger4 , please use the climate.data.suspect field in this file to identify the sites with suspicious climate data. It is coded to indicate which value is bad. When one value is bad, but please double check the others. In case the original pub does not report climate data, please replace the bad value with "NI".

teixeirak commented 3 years ago

Also note: this file and the master ForC_sites DO NOT MATCH because sites missing coordinates are not included in the former.

Also, please create a new column in this file to note when you've reviewed the climate data.

Troger4 commented 3 years ago

Okay, I see there are 284 climate.data.suspect entries with MAP, MAT, min temp, and max temp. What do MAP and MAT represent in columns R and O? Thank you

ValentineHerr commented 3 years ago

Not sure what file you are working with exactly but it must be Mean Annual Precipitation and Mean Annual Temperature.

Metadata for the SITES table is here: https://github.com/forc-db/ForC/blob/master/metadata/sites_metadata.csv

Troger4 commented 3 years ago

Hi Valentine, I'm looking in ForC_sites_climate_data within extracted_sites_data, mean annual precip and mean annual temp makes sense. Thanks very much!

teixeirak commented 3 years ago

Those correspond to columns in ForC_sites, and indicate which have large deviations from the value pulled form the global database (WorldCLim). Be sure to put fixes in ForC_sites (the msater), not in extracted_sites_data.

Troger4 commented 3 years ago

Also note: this file and the master ForC_sites DO NOT MATCH because sites missing coordinates are not included in the former.

Also, please create a new column in this file to note when you've reviewed the climate data.

Which file is the "this file" you referred to? And which should I be looking in to find climate.data.suspect records? Thank you!

teixeirak commented 3 years ago

Sorry, I guess the links were confusing. Here it is with the file names:

Also note: [this file] (https://github.com/forc-db/ForC/blob/master/data/extracted_site_data/ForC_sites_climate_data.csv) and the master ForC_sites DO NOT MATCH because sites missing coordinates are not included in the former.

Also, please create a new column in [this file] (https://github.com/forc-db/ForC/blob/master/data/extracted_site_data/ForC_sites_climate_data.csv) to note when you've reviewed the climate data.

teixeirak commented 3 years ago

@mawilliams99 , this is an issue that you can get started on as an intro to the ForC data work.

There's a lot of discussion above, but summarizing here--

We (specifically Becky Banbury Morgan, @beckybanbury ) identified some climate records in ForC (sites.csv) that are very different from values pulled from a global database. We'll want to go back to the original publications to check those.
Climate data are in the file sites.csv (https://github.com/forc-db/ForC/blob/master/data/ForC_sites.csv). The metadata files explaining each field in the SITES table is here: https://github.com/forc-db/ForC/blob/master/metadata/sites_metadata.csv.
Sites with suspect data are flagged in the climate.data.suspect field in this file (https://github.com/forc-db/ForC/blob/master/data/extracted_site_data/ForC_sites_climate_data.csv). It is coded to indicate which value is bad. When one value is bad, but please double check the others. In case the original pub does not report climate data, please replace the bad value with "NI" (missing value code for "no information").

I'll message you separately to make sure this makes sense.

teixeirak commented 3 years ago

@mawilliams99 or @ValentineHerr , this should be a quick task-- could one of you please merge the field climate.data.suspect field in this file (https://github.com/forc-db/ForC/blob/master/data/extracted_site_data/ForC_sites_climate_data.csv) into the corresponding field in the master sites file? (The difference between the two is that the master includes a few sites with no coordinates.) The motivation for this is that reviewing the climate data doesn't have to happen with high priority, but we want to be sure to review any suspect data on sites we may send to EFDB.

ValentineHerr commented 3 years ago

I'll work on this now

ValentineHerr commented 3 years ago

Looks like Yenisei 2lu and Yenisei 26lh/lw are missing from the file (beside sites without latitudes). I'll go ahead and merge anyways as I believe that you (@teixeirak) have been working with this site recently.

teixeirak commented 3 years ago

Thanks! This is very helpful.

Those two sites would be very similar to the other Yenisei sites, which don't have suspect data, so this is fine.

@mawilliams99 , please be sure to check the climate.data.suspect field in sites for all the studies that you review. We should avoid sending any of those values to EFDB (better to replace with NA, unless there's some really good reason to believe that the current data are correct (e.g., steep topographic gradients that would make the site quite different from most of the surrounding areas)

forc-db / ForC

review climate data #212