DOI-USGS / pgmtl-data-release

A repository for data release scripts and workflows for releasing process-guided meta-transfer learning predictions
Creative Commons Zero v1.0 Universal
2 stars 3 forks source link

Remove errant data (e.g., "LAB" values) from evaluation dataset #12

Closed jordansread closed 4 years ago

jordansread commented 4 years ago

We accepted a certain amount of know flaws in the dataset, as we knew we weren't going to have a comprehensive QAQC effort and there will always be some degree of observation error.

But the "LAB" values and other flaws from this issue are in the current observations dataset and should be removed if possible. If these data issues are constrained to the extended test lakes, that is fine, since those observations are only used in evaluation for the exported value's RMSE. If they impact the 305 test lakes, that is bad because the RMSE tables used to evaluate the MTL performance would then be impacted. Also, these data should not be removed for any reason based on model performance. We should only remove them if they are clearly errant without the aid of any predictions to tell us this.

jordansread commented 4 years ago

These three manual files are flawed based on the earlier analysis: c('7a_temp_coop_munge/tmp/South_Center_DO_2018_09_11_All.rds', '7a_temp_coop_munge/tmp/Carlos_DO_2018_11_05_All.rds', '7a_temp_coop_munge/tmp/Greenwood_DO_2018_09_14_All.rds') They impact three of the 305 test lakes ("nhdhr_120020307" "nhdhr_120020567" "nhdhr_58125241"), but these aren't the only data used for those lakes, so

pgmtl_matched_to_observations %>% filter(site_id == "nhdhr_120020307") %>% group_by(source) %>% summarize(n = length(source), rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE)))
# A tibble: 4 x 3
  source                                                            n  rmse
  <chr>                                                         <int> <dbl>
1 7a_temp_coop_munge/tmp/MN_fisheries_all_temp_data_Jan2018.rds    65  2.08
2 7a_temp_coop_munge/tmp/MPCA_temp_data_all.rds                  1413  1.50
3 7a_temp_coop_munge/tmp/South_Center_DO_2018_09_11_All.rds       853 10.8 
4 wqp                                                             114  1.14

pgmtl_matched_to_observations %>% filter(site_id == "nhdhr_120020567") %>% group_by(source) %>% summarize(n = length(source), rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE)))
# A tibble: 6 x 3
  source                                                                                         n  rmse
  <chr>                                                                                      <int> <dbl>
1 7a_temp_coop_munge/tmp/aitkin_anoka_becker_cook_mnlakedata_historicalfiles_manualentry.rds    63  3.97
2 7a_temp_coop_munge/tmp/Greenwood_DO_2018_09_14_All.rds                                      1043 12.0 
3 7a_temp_coop_munge/tmp/MN_fisheries_all_temp_data_Jan2018.rds                                681  3.75
4 7a_temp_coop_munge/tmp/MPCA_temp_data_all.rds                                                452  3.58
5 7a_temp_coop_munge/tmp/Water_Temp.rds                                                      28151  3.08
6 wqp                                                                                          101  5.11

pgmtl_matched_to_observations %>% filter(site_id == "nhdhr_58125241") %>% group_by(source) %>% summarize(n = length(source), rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE)))
# A tibble: 5 x 3
  source                                                            n  rmse
  <chr>                                                         <int> <dbl>
1 7a_temp_coop_munge/tmp/Carlos_DO_2018_11_05_All.rds             996 11.1 
2 7a_temp_coop_munge/tmp/MN_fisheries_all_temp_data_Jan2018.rds   116  3.31
3 7a_temp_coop_munge/tmp/MPCA_temp_data_all.rds                  1852  2.85
4 7a_temp_coop_munge/tmp/Water_Temp.rds                         46818  2.96
5 wqp                                                             592  2.77

Sites prefixed with "wqp_IL_EPA" only appear in target_expansion_ids. There are 34 sites

feather::read_feather('../lake-temperature-model-prep/7b_temp_merge/out/merged_temp_data_daily.feather') %>% filter(str_detect(source, "wqp_IL_EPA"), site_id %in% target_expansion_ids) %>% pull(site_id) %>% unique()

"nhdhr_109982172" "nhdhr_109984628" "nhdhr_109986464" "nhdhr_109986912" "nhdhr_109987472" "nhdhr_109989384" "nhdhr_109989482"
 [8] "nhdhr_109990726" "nhdhr_121207127" "nhdhr_121207134" "nhdhr_121207285" "nhdhr_121624992" "nhdhr_121625003" "nhdhr_121625323"
[15] "nhdhr_121627799" "nhdhr_121628055" "nhdhr_121628955" "nhdhr_121650552" "nhdhr_121650572" "nhdhr_121650602" "nhdhr_121650613"
[22] "nhdhr_121650633" "nhdhr_121650643" "nhdhr_145607036" "nhdhr_145608202" "nhdhr_145757037" "nhdhr_156039648" "nhdhr_83837813" 
[29] "nhdhr_85083102"  "nhdhr_90588560"  "nhdhr_109992116" "nhdhr_121650562" "nhdhr_121650592" "nhdhr_109986638"

Not including these sites, the worse PB0 rmse is 8.82° (n=2187); but out of these 34, the best performing is 5.33° and worse is 18.2°

pb0_matched_to_observations %>% filter(site_id %in% bad_EPA) %>% group_by(site_id) %>% summarize(rmse = sqrt(mean((pred-obs)^2, na.rm=TRUE))) %>% arrange(desc(rmse)) %>% print(n=100)
# A tibble: 34 x 2
   site_id          rmse
   <chr>           <dbl>
 1 nhdhr_121207127 18.2 
 2 nhdhr_109986912 17.6 
 3 nhdhr_109986464 14.6 
 4 nhdhr_109984628 14.5 
 5 nhdhr_121627799 14.5 
 6 nhdhr_109990726 13.8 
 7 nhdhr_109989482 13.5 
 8 nhdhr_145608202 13.0 
 9 nhdhr_121650552 12.9 
10 nhdhr_121650602 12.8 
11 nhdhr_121628955 12.7 
12 nhdhr_85083102  12.4 
13 nhdhr_121207134 12.2 
14 nhdhr_109986638 12.1 
15 nhdhr_121650633 11.7 
16 nhdhr_109987472 11.2 
17 nhdhr_109982172 11.0 
18 nhdhr_121650592 10.9 
19 nhdhr_121650613 10.7 
20 nhdhr_121625003 10.3 
21 nhdhr_109989384 10.1 
22 nhdhr_121625323  9.72
23 nhdhr_145607036  8.98
24 nhdhr_90588560   8.53
25 nhdhr_121207285  8.41
26 nhdhr_83837813   8.38
27 nhdhr_109992116  8.23
28 nhdhr_121624992  7.70
29 nhdhr_121650643  7.64
30 nhdhr_156039648  7.46
31 nhdhr_121650562  7.07
32 nhdhr_121650572  6.99
33 nhdhr_121628055  6.24
34 nhdhr_145757037  5.33

For this, I have only used information on these specific sources that came from two non-pred related findings: 1) the .rds files were flipped, and 2) these EPA sites had "LAB" temperature data intermingled within actual field temperature readings.

jordansread commented 4 years ago

completed in #10