Different input steps for wide and long data

topepo commented 11 months ago

As previously discussed.

Feel free to modify or suggest modifications. I did change the names of the embedded tibble to location and value.

I used tidyr::unnest() to embed the data. One potential issue is that, if the non-measurement data have duplicate information, the data may not nest well. I'm thinking about how to figure that our when it happens.

topepo commented 11 months ago

Do we think there will be data gaps between samples or samples with different "locations"?

I was working on similar export functions (to send data back to long or wide formats). That's easy if the measurements have a consistent structure.

I think the training set could be used to define the "acceptable locations" we can hold on to. From there, we can reformat into the appropriate structure using that (and induce missing values in the process).

Here is some code to try to illustrate the problem.

library(tidymodels)


tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)


set.seed(2)
ex_data <- 
  crossing(sample = 1:4, location = seq(20, 25, length.out = 6)) %>% 
  mutate(
    value = rnorm(n()),
    # A shift in one sample: 
    location = ifelse(sample == 3, location + .1, location)
    ) %>% 
  # missing one sample
  slice(-3)

# If this was the training set, perhaps save a key: 
loc_key <- 
  distinct(ex_data, location) %>% 
  arrange(location) %>% 
  mutate(key = recipes::names0(n(), "loc_"))

^{Created on 2023-09-22 with reprex v2.0.2}

JamesHWade commented 11 months ago

New Steps

I like the approach of separating inputs into long and wide since it ties into tidyr::pivot_*().

I don't have a better suggestion, but I want to think more on location as the name for our independent variables. It's logical, but it will also be unfamiliar to a typical measurement scientist. Examples that come to mind: time, volume, wavelength, frequency, and position. Again, I don't have a better suggestion -- independent variable is worse IMO. I saw we keep location for now.

Location

It's a safe assumption that location with dance around a bit. A few examples I can think of:

varying sampling rates (e.g., 10 Hz for detector A, 30 Hz for detector B)
"frame shift" of detector starting point

A set of steps for "alignment" would be a good addition. In come cases, you may want to align to a known standard (e.g., a reference peak). In other, you would want to apply a calibration to the day. So something like step_measure_align and step_measure_apply_calibration.

Naively, these seem like hard functions to right, but maybe identify the right abstractions could make it easier.

JamesHWade / measure

Different input steps for wide and long data #12

New Steps

Location