dylanbeaudette / process-kssl-snapshot

code, documentation, models, related to processing the KSSL DB snapshot
5 stars 2 forks source link

Preparation Code Hell #4

Closed dylanbeaudette closed 6 years ago

dylanbeaudette commented 7 years ago

Need to solve problem with multiple (rows) prep codes (most common: "S","GP","HM") in several tables:

Note that some samples have no prep_code, most have "S". and some (notably SoilVeg water retention) are "GP".

Tables affected:

dylanbeaudette commented 6 years ago

An idea: re-shape Supplementary_Tier_3 so that there is a column for every combination of analyte / prep code.

A better idea: don't filter on prep_code == 3.

Example and painful note to self:

library(DBI)
library(soilDB)

# get some examples
q <- 'SELECT * FROM Supplementary_Tier_3'

# setup connection to SQLite DB from FGDB export
db <- dbConnect(RSQLite::SQLite(), "E:/working_copies/lab-data-delivery/code/text-file-to-sqlite/KSSL-data.sqlite")

x <- dbGetQuery(db, q)

# I know that the UCD SoilVeg pedons have some issues with prep code
s <- fetchKSSL('pentz')

# there are some good examples!
x[x$labsampnum %in% s$labsampnum, c('labsampnum', 'prep_code', 'wrd_l2')]

Just look at all of those records that are missing from the fetchKSSL results:

       labsampnum prep_code wrd_l2
58704    83P01338      <NA>   0.15
58705    83P01339      <NA>   0.12
58706    83P01340      <NA>   0.12
58707    83P01341      <NA>     NA
166210   05N02186      <NA>   0.20
166211   05N02187      <NA>   0.17
166212   05N02188      <NA>   0.18
166213   05N02189      <NA>     NA
255582   UCD01518      <NA>   5.80
255583   UCD01519      <NA>  15.90
255821   UCD02775      <NA>   6.70

Some examples of multiple prep codes:

## the underlying problem has to do with duplicate records via multiple prep codes
idx <- names(which(table(x$labsampnum) > 1))
idx <- which(x$labsampnum %in% idx)
y <- x[idx, ]

# some examples
y[1:20, c('labsampnum', 'prep_code', 'wrd_l2')]
     labsampnum prep_code wrd_l2
407     00P00001      <NA>     NA
408     00P00001        HM     NA
409     00P00002      <NA>   0.24
410     00P00002        HM   0.24
411     00P00002         S   0.24
9116    40A10739      <NA>   0.12
9117    40A10739         S   0.12
9118    40A10740      <NA>   0.16
9119    40A10740         S   0.16
9120    40A10741      <NA>   0.18
9121    40A10741         S   0.18
9122    40A10742      <NA>   0.15
9123    40A10742         S   0.15
9124    40A10743      <NA>   0.14
9125    40A10743         S   0.14
9126    40A10744      <NA>   0.11
9127    40A10744         S   0.11
9128    40A10745      <NA>   0.12
9129    40A10745         S   0.12
44370   79P00521      <NA>   0.32
dylanbeaudette commented 6 years ago

This is kind of a hack, but would "solve" the problem in the Tier_3 table: keep only the distinct set of labsampnum and wrd_l2 values, ignoring the prep code:

# get some examples
q <- 'SELECT DISTINCT labsampnum, wrd_l2 FROM Supplementary_Tier_3;'

z <- dbGetQuery(db, q)

# 1:1 relationship between labsampnum and wrd_l2 when we ignore prep code?
# result is FALSE
any(table(z$labsampnum) > 1)

This means that the proposed filtering is good enough for wrd_l2, no matter the prep code.

dylanbeaudette commented 6 years ago

An example for the bulk density table, less of a problem since 90%+ of records use prep code S.

library(DBI)
library(soilDB)

# get some examples
q <- 'SELECT * FROM Bulk_Density_and_Moisture'

# setup connection to SQLite DB from FGDB export
db <- dbConnect(RSQLite::SQLite(), "E:/working_copies/lab-data-delivery/code/text-file-to-sqlite/KSSL-data.sqlite")

x <- dbGetQuery(db, q)

# I know that the UCD SoilVeg pedons have some issues with prep code
s <- fetchKSSL('pentz')

# there are some good examples!
x[x$labsampnum %in% s$labsampnum, c('labsampnum', 'prep_code', 'db_13b', 'db_od', 'COLEws', 'wrd_ws13', 'w3cld', 'w15l2', 'w15cly')]

round(prop.table(table(x$prep_code, useNA = 'always')), 2)
dylanbeaudette commented 6 years ago

The latest commit seems to have included all of the wrd_12 values.

https://github.com/dylanbeaudette/process-kssl-snapshot/commit/1ef8db0bc39e73642ce5e23e99be05ae1064b9e4#diff-73bd0cf7615fccf6e232df97fb72678e

I'll push these changes to SoilWeb tomorrow.