Closed dylanbeaudette closed 6 years ago
An idea: re-shape Supplementary_Tier_3
so that there is a column for every combination of analyte / prep code.
A better idea: don't filter on prep_code == 3.
Example and painful note to self:
library(DBI)
library(soilDB)
# get some examples
q <- 'SELECT * FROM Supplementary_Tier_3'
# setup connection to SQLite DB from FGDB export
db <- dbConnect(RSQLite::SQLite(), "E:/working_copies/lab-data-delivery/code/text-file-to-sqlite/KSSL-data.sqlite")
x <- dbGetQuery(db, q)
# I know that the UCD SoilVeg pedons have some issues with prep code
s <- fetchKSSL('pentz')
# there are some good examples!
x[x$labsampnum %in% s$labsampnum, c('labsampnum', 'prep_code', 'wrd_l2')]
Just look at all of those records that are missing from the fetchKSSL
results:
labsampnum prep_code wrd_l2
58704 83P01338 <NA> 0.15
58705 83P01339 <NA> 0.12
58706 83P01340 <NA> 0.12
58707 83P01341 <NA> NA
166210 05N02186 <NA> 0.20
166211 05N02187 <NA> 0.17
166212 05N02188 <NA> 0.18
166213 05N02189 <NA> NA
255582 UCD01518 <NA> 5.80
255583 UCD01519 <NA> 15.90
255821 UCD02775 <NA> 6.70
Some examples of multiple prep codes:
## the underlying problem has to do with duplicate records via multiple prep codes
idx <- names(which(table(x$labsampnum) > 1))
idx <- which(x$labsampnum %in% idx)
y <- x[idx, ]
# some examples
y[1:20, c('labsampnum', 'prep_code', 'wrd_l2')]
labsampnum prep_code wrd_l2
407 00P00001 <NA> NA
408 00P00001 HM NA
409 00P00002 <NA> 0.24
410 00P00002 HM 0.24
411 00P00002 S 0.24
9116 40A10739 <NA> 0.12
9117 40A10739 S 0.12
9118 40A10740 <NA> 0.16
9119 40A10740 S 0.16
9120 40A10741 <NA> 0.18
9121 40A10741 S 0.18
9122 40A10742 <NA> 0.15
9123 40A10742 S 0.15
9124 40A10743 <NA> 0.14
9125 40A10743 S 0.14
9126 40A10744 <NA> 0.11
9127 40A10744 S 0.11
9128 40A10745 <NA> 0.12
9129 40A10745 S 0.12
44370 79P00521 <NA> 0.32
This is kind of a hack, but would "solve" the problem in the Tier_3 table: keep only the distinct set of labsampnum
and wrd_l2
values, ignoring the prep code:
# get some examples
q <- 'SELECT DISTINCT labsampnum, wrd_l2 FROM Supplementary_Tier_3;'
z <- dbGetQuery(db, q)
# 1:1 relationship between labsampnum and wrd_l2 when we ignore prep code?
# result is FALSE
any(table(z$labsampnum) > 1)
This means that the proposed filtering is good enough for wrd_l2
, no matter the prep code.
An example for the bulk density table, less of a problem since 90%+ of records use prep code S.
library(DBI)
library(soilDB)
# get some examples
q <- 'SELECT * FROM Bulk_Density_and_Moisture'
# setup connection to SQLite DB from FGDB export
db <- dbConnect(RSQLite::SQLite(), "E:/working_copies/lab-data-delivery/code/text-file-to-sqlite/KSSL-data.sqlite")
x <- dbGetQuery(db, q)
# I know that the UCD SoilVeg pedons have some issues with prep code
s <- fetchKSSL('pentz')
# there are some good examples!
x[x$labsampnum %in% s$labsampnum, c('labsampnum', 'prep_code', 'db_13b', 'db_od', 'COLEws', 'wrd_ws13', 'w3cld', 'w15l2', 'w15cly')]
round(prop.table(table(x$prep_code, useNA = 'always')), 2)
The latest commit seems to have included all of the wrd_12
values.
I'll push these changes to SoilWeb tomorrow.
Need to solve problem with multiple (rows) prep codes (most common: "S","GP","HM") in several tables:
Note that some samples have no
prep_code
, most have "S". and some (notably SoilVeg water retention) are "GP".Tables affected:
Bulk_Density_and_Moisture
: not really a problemSupplementary_Tier_3
: major problem!