data formatting, internal and external

topepo commented 2 years ago

Just some thoughts about data structures...

This will be much more informed when we have examples of more complex experiments and complex instrument results.

"External format"

This is the shape of the data as the user has it.

There are a few ways that the data could be formatted by the user. I'll use the tidyr terminology of "longer" and "wider".

Wider would be where the wavelength values are common across samples and the intensity data are in columns. The number of rows probably represents the number of samples in the data. The meats data in the model data package is formatted like this:

> meats %>% relocate(water, fat, protein)
# A tibble: 215 × 103
   water   fat protein x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010
   <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  60.5  22.5    16.7  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.63  2.63
 2  46    40.1    13.5  2.83  2.84  2.84  2.85  2.85  2.86  2.86  2.87  2.87  2.88
 3  71     8.4    20.5  2.58  2.58  2.59  2.59  2.59  2.59  2.59  2.60  2.60  2.60
 4  72.8   5.9    20.7  2.82  2.82  2.83  2.83  2.83  2.83  2.83  2.84  2.84  2.84
 5  58.3  25.5    15.5  2.79  2.79  2.79  2.79  2.80  2.80  2.80  2.80  2.81  2.81
 6  44    42.7    13.7  3.01  3.02  3.02  3.03  3.03  3.04  3.04  3.05  3.06  3.06
 7  44    42.7    13.7  2.99  2.99  3.00  3.01  3.01  3.02  3.02  3.03  3.04  3.04
 8  69.3  10.6    19.3  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.54  2.54
 9  61.4  19.9    17.7  3.27  3.28  3.29  3.29  3.30  3.31  3.31  3.32  3.33  3.33
10  61.4  19.9    17.7  3.40  3.41  3.41  3.42  3.43  3.43  3.44  3.45  3.46  3.47
# … with 205 more rows, and 90 more variables: x_011 <dbl>, x_012 <dbl>,
#   x_013 <dbl>, x_014 <dbl>, x_015 <dbl>, x_016 <dbl>, x_017 <dbl>, x_018 <dbl>,
#   x_019 <dbl>, x_020 <dbl>, x_021 <dbl>, x_022 <dbl>, x_023 <dbl>, x_024 <dbl>,
#   x_025 <dbl>, x_026 <dbl>, x_027 <dbl>, x_028 <dbl>, x_029 <dbl>, x_030 <dbl>,
#   x_031 <dbl>, x_032 <dbl>, x_033 <dbl>, x_034 <dbl>, x_035 <dbl>, x_036 <dbl>,
#   x_037 <dbl>, x_038 <dbl>, x_039 <dbl>, x_040 <dbl>, x_041 <dbl>, x_042 <dbl>,
#   x_043 <dbl>, x_044 <dbl>, x_045 <dbl>, x_046 <dbl>, x_047 <dbl>, x_048 <dbl>, …

A longer version would be where there is a column for the wavelength (or some frequency-type index) and another column for the outcome (e.g. intensity, absorption, etc).

For the meats data, that would look like

> meat_longer
# A tibble: 21,500 × 6
   water   fat protein sample intensity wavelength
   <dbl> <dbl>   <dbl>  <int>     <dbl>      <dbl>
 1  60.5  22.5    16.7      1      2.62          1
 2  60.5  22.5    16.7      1      2.62          2
 3  60.5  22.5    16.7      1      2.62          3
 4  60.5  22.5    16.7      1      2.62          4
 5  60.5  22.5    16.7      1      2.62          5
 6  60.5  22.5    16.7      1      2.62          6
 7  60.5  22.5    16.7      1      2.62          7
 8  60.5  22.5    16.7      1      2.62          8
 9  60.5  22.5    16.7      1      2.63          9
10  60.5  22.5    16.7      1      2.63         10
# … with 21,490 more rows

We should be able to work with data in either format.

"Internal format"

Internal to the recipe, the longer format is better but we probably want to store the data in a more compact way.

For the combinations of the non-measurement columns, we should put the spectroscopy data in a compact format.

For the meat data (in longer format), that would be

> meat_grouped
# A tibble: 215 × 5
   water   fat protein sample      .measurements
   <dbl> <dbl>   <dbl>  <int> <list<tibble[,2]>>
 1  60.5  22.5    16.7      1          [100 × 2]
 2  46    40.1    13.5      2          [100 × 2]
 3  71     8.4    20.5      3          [100 × 2]
 4  72.8   5.9    20.7      4          [100 × 2]
 5  58.3  25.5    15.5      5          [100 × 2]
 6  44    42.7    13.7      6          [100 × 2]
 7  44    42.7    13.7      7          [100 × 2]
 8  69.3  10.6    19.3      8          [100 × 2]
 9  61.4  19.9    17.7      9          [100 × 2]
10  61.4  19.9    17.7     10          [100 × 2]
# … with 205 more rows

The rows again reflect the total number of samples and .measurements is a list column with the assay results:

> meat_grouped$.measurements[[1]]
# A tibble: 100 × 2
   intensity wavelength
       <dbl>      <dbl>
 1      2.62          1
 2      2.62          2
 3      2.62          3
 4      2.62          4
 5      2.62          5
 6      2.62          6
 7      2.62          7
 8      2.62          8
 9      2.63          9
10      2.63         10
# … with 90 more rows

We could have an initial function that can make this conversion. Something like step_spectra_collect(outcome, index) to make the formatting (I think that we could have step names that start with step_spectra_* or something).

Here's some example code to go between formats for two examples:

library(janitor)
library(tidymodels)

# ------------------------------------------------------------------------------

tidymodels_prefer()
theme_set(theme_bw())

# ------------------------------------------------------------------------------

data(meats)

meat_longer <-
  meats %>%
  mutate(sample = row_number()) %>%
  pivot_longer(c(starts_with("x_")), names_to = "name", values_to = "intensity") %>%
  mutate(wavelength = as.numeric(gsub("x_", "", name))) %>%
  select(-name)

meat_grouped <-
  meat_longer %>%
  group_by(water, fat, protein, sample) %>% 
  group_nest(.key = ".measurements") %>% 
  arrange(sample)

# ------------------------------------------------------------------------------

load(url("https://github.com/topepo/FES/blob/master/Data_Sets/Pharmaceutical_Manufacturing_Monitoring/small_scale.RData?raw=true"))

pharma_longer <-
  small_scale %>%
  clean_names() %>% 
  select(-batch_sample) %>% 
  pivot_longer(c(starts_with("x")), names_to = "name", values_to = "intensity") %>%
  mutate(wavelength = as.numeric(gsub("x", "", name))) %>%
  select(-name)

pharma_grouped <-
  pharma_longer %>%
  group_by(batch_id, sample, batch_sample, glucose) %>% 
  group_nest(.key = ".measurements") %>% 
  arrange(sample)

JamesHWade commented 1 year ago

A first pass at this is addressed by #7. I'm sure we can make it a lot better but it "works." Feedback is more than welcome since I'm still very much in "learning" mode for recipes.

topepo commented 11 months ago

A friend and I were working on a data set like this the other day, prompting me to get off my 🍑 a bit on this.

Would it make sense to:

Have two different recipe steps that collate the data: one for wide inputs and another for long inputs?
Use a common step prefix in the package. So maybe step_spectra_input_wide() and step_spectra_input_long() (then later things like step_spectra_{baseline subtract technique} and so on)? Tab-complete has been very helpful for recipe step names.

topepo commented 11 months ago

Hmm. Is "spectra" too specific?

JamesHWade commented 11 months ago

I like wide vs long for function names. "Spectra" is a bit too specific. It works for a lot of the acronym soup of measurement science (e.g., NMR, MS, IR, UV/VIS) but misses on others (e.g., chromatrogram, thermogram). Are step_measure_input_wide() and step_measure_input_long() too generic?

topepo commented 11 months ago

Are step_measure_input_wide() and step_measure_input_long() too generic?

Nope!

I'll work on a PR and then another to re-do the data into long and wide formats.

JamesHWade / measure

data formatting, internal and external #5

"External format"

"Internal format"