datacamp / funneljoin

Join tables based on events occurring in sequence in a funnel.
Other
140 stars 7 forks source link

Bug with first-firstafter w/ remote tables #9

Closed robinsones closed 5 years ago

robinsones commented 5 years ago
landed <- tbl_views_snowplow_user_page_views() %>%
  filter(page_url_path %in% c('/home', '/home/'), 
         page_view_start > "2018-07-01") %>%
  select(tstamp = page_view_start, user_id = signed_in_user_id)  

exercises <- tbl_main_user_exercises() %>%
  filter(started_at > "2018-07-01", 
         completed) %>%
  select(tstamp = started_at, user_id) 

Something is wrong with "first-firstafter" with remote tables. This works:

after_join(landed, exercises, by_time = "tstamp", 
by_user = "user_id", mode = "inner", type = "first-any") 

as does collecting the two tables, converting both user_ids to integers, and then running the join.

But this does not:

after_join(landed, exercises, by_time = "tstamp", 
by_user = "user_id", mode = "inner", type = "first-firstafter") 

It gets the error "Error in result_create(conn@ptr, statement) : Failed to prepare query: ERROR: column "tstamp" does not exist in tvxfchveli"

robinsones commented 5 years ago

Sometimes actually none of them work 😕

two_weeks_ago <- today() - days(14)

landed <- tbl_views_snowplow_user_page_views() %>%
  filter(page_url_path %in% c('/courses', '/courses/'), 
         page_view_start >= two_weeks_ago, 
         page_url_host == 'www.datacamp.com', 
         is.na(signed_in_user_id)) %>%
  select(tstamp = page_view_start, domain_userid) 

registered <- tbl_snowplow_register_1() %>%
   filter(root_tstamp >= two_weeks_ago) %>%
    inner_join(tbl_redshift('snowplow_atomic.events') %>%
               filter(collector_tstamp >= two_weeks_ago) %>%
               select(root_id = event_id, event, 
                      event_tstamp = collector_tstamp, domain_userid,
                      user_id),
             by = c("root_id", "root_tstamp" = "event_tstamp")) %>%
  select(tstamp = root_tstamp, domain_userid, user_id)

registration_funnel <- landed %>%
    after_join(registered, 
           by_time = "tstamp", 
           by_user = "domain_userid", 
           mode = "left", 
           type = "first-first") 

exercises <- tbl_main_user_exercises() %>%
  filter(started_at >= two_weeks_ago, 
         completed) %>%
  select(timestamp = started_at, user_id)

This fails whatever type you use

registration_funnel %>%
  filter(!is.na(user_id)) %>%
  rename(timestamp = tstamp.y) %>%
  select(-tstamp.x) %>% 
  after_join(exercises, 
           by_time = "timestamp", 
           by_user = "user_id", 
           mode = "left", 
           type = "first-any") 

but it works when you collect then run.

robinsones commented 5 years ago

@dgrtwo, anthony and I have pinpointed the problem at least for the first-firstafter join. It's happening here: https://github.com/datacamp/funneljoin/blob/master/R/after_join.R#L156

We figure it's some sort of tidyeval issue, but we can't figure it out. It's especially confusing as this code up here works: https://github.com/datacamp/funneljoin/blob/master/R/after_join.R#L138

The error is:

Error in result_create(conn@ptr, statement) : 
  Failed to prepare query: ERROR:  column "tstamp" does not exist in nfmnrlxsup
robinsones commented 5 years ago

fixed by this commit