business-science / timetk

Time series analysis in the `tidyverse`
https://business-science.github.io/timetk/
612 stars 99 forks source link

Capturing vector transformation parameters #127

Open realauggieheschmeyer opened 2 years ago

realauggieheschmeyer commented 2 years ago

Both log_interval_vec() and standardize_vec() will print the auto-detected parameters used to scale the target variable.

For example:

log_interval_vec(): 
 Using limit_lower: 0
 Using limit_upper: 12
 Using offset: 1

Standardization Parameters
mean: -3.0500341071016
standard deviation: 1.22764358571979

However, there is currently no native way to capture these parameters outside of reviewing the printed text and manually saving the information. This isn't a problem for one-off analyses but prevents one from using these functions as part of an automated forecasting workflow. The target variable can be scaled automatically but without being able to store and access the parameters later, any predictions on the new variable can not be transformed back to the original scale without human intervention.

It would be nice to have some helper function that can be run prior to mutating your target variable to extract the relevant parameters and save them for later in the workflow.

Below is the code I wrote to capture these parameters manually:

log_params <- ticket_volume_pad_tbl %>% 
  group_by(department, ticket_type) %>% 
  summarize(
    limit_lower = 0,
    limit_upper = (max(tickets) * 1.1) + 1,
    .groups = "drop"
  )

standardization_params <- ticket_volume_pad_tbl %>% 
  left_join(log_params, by = c("department", "ticket_type")) %>% 
  mutate(
    tickets_scaled = log(((tickets + 1) - limit_lower) / (limit_upper - (tickets + 1)))
  ) %>% 
  group_by(department, ticket_type) %>% 
  summarize(
    mean = mean(tickets_scaled),
    standard_deviation = sd(tickets_scaled),
    .groups = "drop"
  )

log_params %>% 
  left_join(standardization_params, by = c("department", "ticket_type"))

If it's helpful, I can try my hand at converting the above into a function but I'd love some guidance on how to style it appropriately within the existing timetk functions.

spsanderson commented 2 years ago

I did something similar but it was strictly for my own use, see here:

https://github.com/spsanderson/healthyverse_tsa/blob/master/00_scripts/data_manipulation_functions.R

realauggieheschmeyer commented 2 years ago

In addition to automated workflows, the manual nature of this process would also be problematic if you had a large number of groups in your data. Just imagine trying to forecast retail SKUs and having to manually log hundreds or thousands of parameters 😰