datacamp / funneljoin

Join tables based on events occurring in sequence in a funnel.
Other
140 stars 7 forks source link

Some suggestions #22

Open ramnathv opened 5 years ago

ramnathv commented 5 years ago

@robinsones First off, great package! It makes creating and analyzing funnels clean and easy. Based on my navigation of the package's API, I had some suggestions:

You might want to make landed and registered datasets in the package so users can get started with the examples without having to define them. This can be extended to define out-of-memory versions of the data by taking advantage of dbplyr::src_dbi.

The notion of a join type is great and factors in the multiple scenarios that one might run into. However, I had trouble visualizing the execution and how each type results in a different output compared to the pure version of the join function. So I put together a helper function that allows one to visualize the differences. Based on this, here is my understanding of the working of after_join:

  1. Join the tables using the regular version of {mode}_join
  2. Filter out all records where event_x occurs before event_y
  3. For each user_id only retain versions of events specified by type
    • any-any will retain all records in Step (2)
    • first-first, first-firstafter, and lastbefore-firstafter will retain only ONE record per user.

Is my understanding correct? It might be useful to add something like this to the documentation so it is really clear to users how these after_joins work.

after_join_all ```r after_join_all <- function(x, y, by, mode = 'inner', ...){ types <- c( 'first-first', 'first-firstafter', 'lastbefore-firstafter', 'any-firstafter', 'any-any' ) by_type <- function(type){ after_join(x, y, ..., mode = mode, type = type) %>% mutate(!!type := 'Y') } join_fun <- match.fun(paste0(mode, '_join')) all_types <- types %>% purrr::map(by_type) join_fun(x, y, by = by) %>% Reduce(left_join, all_types, init = .) } ```
user_id timestamp.x timestamp.y first-first first-firstafter lastbefore-firstafter any-firstafter any-any
1 2018-07-01 2018-07-02 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
3 2018-07-02 2018-07-02 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
4 2018-07-01 2018-06-10 NA NA NA NA NA
4 2018-07-01 2018-07-02 NA :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
4 2018-07-04 2018-06-10 NA NA NA NA NA
4 2018-07-04 2018-07-02 NA NA NA NA NA
5 2018-07-10 2018-07-11 :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
5 2018-07-12 2018-07-11 NA NA NA NA NA
6 2018-07-07 2018-07-10 :white_check_mark: :white_check_mark: NA :white_check_mark: :white_check_mark:
6 2018-07-07 2018-07-11 NA NA NA NA :white_check_mark:
6 2018-07-08 2018-07-10 NA NA :white_check_mark: :white_check_mark: :white_check_mark:
6 2018-07-08 2018-07-11 NA NA NA NA :white_check_mark:
robinsones commented 5 years ago

Thanks Ramnath! I'll work on these this week.

robinsones commented 5 years ago

@ramnathv from my understanding of this function, you'd want to specify by_user and by_time in it, and the by argument would be the same as the by_user. Is that right? If so, it seems simpler to replace by with by_user and by_time

How I reproduced your result:

landed <- tibble::tribble(
  ~user_id, ~timestamp,
  1, "2018-07-01",
  2, "2018-07-01",
  3, "2018-07-02",
  4, "2018-07-01",
  4, "2018-07-04",
  5, "2018-07-10",
  5, "2018-07-12",
  6, "2018-07-07",
  6, "2018-07-08"
) %>%
  mutate(timestamp = as.Date(timestamp))

registered <- tibble::tribble(
  ~user_id, ~timestamp,
  1, "2018-07-02",
  3, "2018-07-02",
  4, "2018-06-10",
  4, "2018-07-02",
  5, "2018-07-11",
  6, "2018-07-10",
  6, "2018-07-11",
  7, "2018-07-07"
) %>%
  mutate(timestamp = as.Date(timestamp))

after_join_all(x = landed, 
               y = registered, 
               by = "user_id",
               by_user = "user_id", 
               by_time = "timestamp") 
ramnathv commented 5 years ago

@robinsones That will work! Note that I am proposing this only as an internal utility function to help make clear the distinction of the output produced by the different variations, since it is a little nuanced. It will only be useful when working with toy datasets.