COMPASS-DOE / rs-synthesis

0 stars 0 forks source link

QA-QC for existing dataset #42

Closed kendalynnm closed 2 years ago

kendalynnm commented 2 years ago

Two things that I noticed while working with the meta_df.

1) Duplicate rows left join of meta_manip and meta_control results in a df with 454 rows, but passing this df through distinct() drops 64(!) rows

2) Mismatch in Study_midyear and Percent_control Only current known instance is with study number 6066 (author, Thomey). Percent_control is incorrect, but Manipulation_level appears to be correct. Source of error appears to be human data-entry from "Variance and N - Water manipulations.csv"

bpbond commented 2 years ago

Study 1421 - why are there only two rows in the Google sheet but 6 rows in SRDB (and older versions of the sheet as downloaded to csv).

bpbond commented 2 years ago

List of suspect studies with more than one control for a given manipulation:

1421 3710 3978 4549 6066 6066 6112 6112 6166 6168 6168 6168 6168 6168 6168 6168 6168 6168 6540 7046 7048 7115 7593 7593 8163 8509 9649 10816 11031 11064 11064 11064 11064 11078 11859

Generated by

meta_manip %>% 
+     select(-Quality_flag) %>% 
+     left_join(meta_control, 
+               by = c("Study_number", "Study_midyear", "Ecosystem_type",
+                      "Latitude", "Meas_method", "Soil_type", "Soil_drainage",
+                      "Elevation", "depvar")) %>% group_by(Study_number, Study_midyear, Ecosystem_type,
+                                                           Latitude, Meas_method, Soil_type, Soil_drainage,
+                                                           Elevation, depvar) %>% summarise(n=n()) %>% filter(n>1) %>% pull(Study_number)
kendalynnm commented 2 years ago

Suspect studies grouped by proposed fix: Remove a record • 3710 o A regional control value as well as a value from another control plot (paired with a warming manipulation) is included. Desired records are 2371 & 2369, should be one row of data in the end, possible solution is to delete record 2368.

Adding species • 3978 o Should be 4 entries, paired by soil texture and microhabitat. Currently there are 6 entries with some cross over – hard to compare b/c of unit conversion, not sure how it was converted, my calcs give a different answer, probably annual is not considered 365 days of the g/m2/d rate o Adding species data should fix it • 7046 o Should be 2 rows, one for alfalfa, one for needle grass, currently 4 rows with species crossed o adding species would fix • 7048 o Currently have 18 rows, should have 6 o adding species would fix • 8509 o Have 18 rows, should have 6, species will fix it • 11031 o Have 4 rows, should have 2, species will fix

Adding something else which currently doesn’t exist in SRDB • 4549 o Should be 2 rows, manipulation and control for unrestored and restored, currently 4 rows o Need to add this restored vs unrestored information for proper matching, currently does not exist in SRDB • 8163 o Have 16 rows, should have 8, species will mostly fix it. Second Stipa record should be ‘mixed’ or ‘fallow’ • 10816 o Have 4 rows, should have 2 “hummocks” vs “hollows”, this data is not currently in the SRDB • 11859 o Have 4 rows, should have 2 “high” vs “low” biological crust, this data is not currently in the SRDB

Human error • 6066 o Correct number of rows, but mismatch in Study_midyear and Percent_control. Percent_control is incorrect, but Manipulation_level appears to be correct. Source of error appears to be human data-entry from "Variance and N - Water manipulations.csv"??

Other • 6168 o 1998 – 2002 n = either 3 or 12?, 2003 – 2007 either 3 or 6? Plot data was combined, so n = 3 is likely the most accurate. 5 years of data with two MA appropriate treatments in each year – should be 10 rows of data, have 26. Should be evenly distributed between 70% precip and altered timing 70% precip. Not sure why this one isn’t working. • 7593 o Should have 4 rows, have 8, dups with summer control where winter control should be and vice versa o Adding manipulation level should fix (same for 6168, I think).

Correct, yay! • 6112 o Same study structure as 6066, but appears correct • 6166 o Correct • 6540 o Correct • 7115 o Correct • 9649 o Correct • 11064 o Correct! • 11078 o Correct!

bpbond commented 2 years ago

Thank you @kendalynnm !

bpbond commented 2 years ago
kendalynnm commented 2 years ago

Follow-up checklist for Google sheet