Remove all references to survey results

shankari commented 3 years ago

This is a first step towards splitting the code into the common functionality, which deals with sensed and analysed results, and the study-specific functionality, which will go into another module in the future.

This removes the existing implementations of query_trip_ends, query_user_profile_survey, and tidy_trip_ends, and all references to them.

The one place where we didn't just remove the reference was in most_recent_trip_ends. The subsequent code deals with missing recent trip ends, so we set most_recent_trip_ends to an empty vector instead of removing the subsequent references.

asiripanich commented 3 years ago

https://github.com/asiripanich/emdash/issues/2#issuecomment-708052926

About this comment, the uuid field was dropped and renamed to user_id using normalise_uuid(). As is, uuid is a list column, so it has to be flattened into a char. Do you prefer consistent naming to the original data?

asiripanich commented 3 years ago

Please send me your test database so I can review this PR.

ahcyip commented 3 years ago

#2 (comment)

About this comment, the uuid field was dropped and renamed to user_id using normalise_uuid(). As is, uuid is a list column, so it has to be flattened into a char. Do you prefer consistent naming to the original data?

@shankari List-columns is an R tidyverse thing I could help with :) actually I think Python pandas and Julia added list-columns too recently...

shankari commented 3 years ago

About this comment, the uuid field was dropped and renamed to user_id using normalise_uuid(). As is, uuid is a list column, so it has to be flattened into a char. Do you prefer consistent naming to the original data?

@asiripanich @ahcyip the problem is that there is no uuid field in the trips dataframe. The output of the print(names(.data)) line https://github.com/asiripanich/emdash/pull/4/files#diff-c750c3f7803731b61a50428fb1429cd242398df6a8483044b54c7997e2f7ce54R75

is

 [1] "user_id"                          "metadata.key"
 [3] "metadata.platform"                "metadata.write_ts"
 [5] "metadata.time_zone"               "metadata.write_local_dt.year"
 [7] "metadata.write_local_dt.month"    "metadata.write_local_dt.day"
 [9] "metadata.write_local_dt.hour"     "metadata.write_local_dt.minute"
[11] "metadata.write_local_dt.second"   "metadata.write_local_dt.weekday"
[13] "metadata.write_local_dt.timezone" "metadata.write_fmt_time"
[15] "data.source"                      "data.end_ts"
[17] "data.end_local_dt.year"           "data.end_local_dt.month"
[19] "data.end_local_dt.day"            "data.end_local_dt.hour"
[21] "data.end_local_dt.minute"         "data.end_local_dt.second"
[23] "data.end_local_dt.weekday"        "data.end_local_dt.timezone"
[25] "data.end_fmt_time"                "data.end_loc.type"
[27] "data.end_loc.coordinates"         "data.raw_trip"
[29] "data.start_ts"                    "data.start_local_dt.year"
[31] "data.start_local_dt.month"        "data.start_local_dt.day"
[33] "data.start_local_dt.hour"         "data.start_local_dt.minute"
[35] "data.start_local_dt.second"       "data.start_local_dt.weekday"
[37] "data.start_local_dt.timezone"     "data.start_fmt_time"
[39] "data.start_loc.type"              "data.start_loc.coordinates"
[41] "data.duration"                    "data.distance"
[43] "data.start_place"                 "data.end_place"

There is no uuid column.

asiripanich commented 3 years ago

I'm reviewing the PR right now. Should not be hard to resolve this issue. :)

shankari commented 3 years ago

@asiripanich but that error is before the rename. The print statement is at line 75 The uuid column check is line 76 The rename to user_id and the delete of the uuid field is on lines 80 and 81-83

I don't see how this can be due to the rename/refactor in that case, unless I really don't understand R at all.

shankari commented 3 years ago

@ahcyip I don't think the problem is with list columns at this point because the check (line 76) is before all the complicated stuff with the columns (line 80, sapply(uuid, function(.x) paste0(unlist(.x), collapse = ""))]).

Or maybe I just don't understand R 😄

asiripanich commented 3 years ago

To say the least, I had created a bug in 3cb9df924582507d3eb9b3da554c7e4b25e3b66c, so I reverted that commit and removed all references to survey results. Please try it on your machine and let me know how it goes.

Once you confirm that this is working, please consider merging the master branch into this feature branch so that the new CI can check this PR.

shankari commented 3 years ago

Good catch on the .gitignore - I should have included that. Just for my own edification, the main change to normalise_uuid was to change from if (! uuid column) stop to if (uuid column) { do normalization } right?

shankari commented 3 years ago

I am pulling and testing this now.

shankari commented 3 years ago

@asiripanich @ahcyip I ran into the error

About to load trips
Finished query, about to clean trips
Warning: Error in -: invalid argument to unary operator
  [No stack trace available]
Warning in min(data_r$participants$update_ts) :
  no non-missing arguments to min; returning Inf

After adding some additional logging, I figured out that this was in tidy_cleaned_trips. Since that is effectively a giant function chain, and I'm not sure how to add logging to a function chain, I commented out each function that had a unary operator. It turned out that the problem was with

    {
      sf::st_sf(.[, -"geometry"],
                geometry = .[["geometry"]],
                crs = project_crs)
    } %>%

If I remove that, the execution moves a lot further

About to load trips
Finished query, about to clean trips
Finished cleaning trips
Finished loading trips
About to load participants
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token",  :
  Column 'uuid' does not exist to remove
...

shankari commented 3 years ago

It looks like the problem is that you can't use a string when dropping columns?

In https://r.789695.n4.nabble.com/quot-invalid-argument-to-unary-operator-quot-while-selecting-rows-by-name-td4724899.html

data["601",] doesn't generate an error because you can also refer to a
row by its name, as an alternative to refering to it by row number.
It's the same with vectors, just consider the following case.

(x <- c("601"=1, b=2)) x[1] x["601"] # the same

But when you want to remove it you must negate an index number so
x[-"601"] is wrong for reasons already explained.

shankari commented 3 years ago

Fix seems to be to use dplyr::select

      sf::st_sf(dplyr::select(., -geometry),
                geometry = .[["geometry"]],
                crs = project_crs)

Now the error is

About to load participants
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token",  :
  Column 'uuid' does not exist to remove
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token",  :
  Column 'device_token' does not exist to remove
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token",  :
  Column 'curr_sync_interval' does not exist to remove

I have to work on something else for a bit but will return to this after that.

shankari commented 3 years ago

Next set of errors

Warning: Error in sf::st_drop_geometry: st_drop_geometry only works with objects of class sf
  [No stack trace available]
Warning in min(data_r$participants$update_ts) :
  no non-missing arguments to min; returning Inf
Warning in min(data_r$participants$update_ts) :
  no non-missing arguments to min; returning Inf
Warning: Error in UseMethod: no applicable method for 'mutate_' applied to an object of class "NULL"
  [No stack trace available]
Warning: Error in UseMethod: no applicable method for 'mutate_' applied to an object of class "NULL"
  [No stack trace available]
Warning in grSoftVersion() :
  unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
  libXt.so.6: cannot open shared object file: No such file or directory

shankari commented 3 years ago

@asiripanich can you check https://github.com/asiripanich/emdash/pull/4/commits/1c28ef78138d0db5e6c209d5d20584ad965c6f79

I am not sure if it is correct to do that. Without it, I get the error Warning: Error in sf::st_drop_geometry: st_drop_geometry only works with objects of class sf. But we do convert to an st object in R/utils_tidy_data.R and add a geometry.

I am not sure why we need to add a geometry and then remove it....

ahcyip commented 3 years ago

It looks like I'm too late but you can use %T>% to add print statements and keep the pipe going: https://stackoverflow.com/questions/61196304/magrittr-tee-pipe-t-equivalent

shankari commented 3 years ago

Ah so this error https://github.com/asiripanich/emdash/pull/4#issuecomment-708641799 appears to be because data_r$trips is null, even after tidy_cleaned_trips is done

    message("Finished loading trips")
    print(data_r$trips)

outputs

Finished loading trips
NULL

That is also probably the root cause for many of the subsequent errors as well.

shankari commented 3 years ago

Ah adding a return(cleaned_trips_sf) to the end of tidy_cleaned_trips has got us almost all the way there. Also reverted https://github.com/asiripanich/emdash/commit/1c28ef78138d0db5e6c209d5d20584ad965c6f79 since the problem was the NULL value.

Remaining errors

Finished loading participants
Warning in grSoftVersion() :
  unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
  libXt.so.6: cannot open shared object file: No such file or directory
Warning: Error in FUN: object 'branch' not found
  [No stack trace available]
Warning: Error in FUN: object 'curr_platform' not found
  [No stack trace available]

But the dashboard is actually visible finally. Woo! Woo! The end of the tunnel is in sight.

shankari commented 3 years ago

ok so the final two entries are because:

we don't have multiple clients
the curr_platform is only set when we have push notifications working

I will remove the client graph (@asiripanich any thoughts on other data to put there?) I will make sure that push notifications work in the android emulator to get the second graph to work.

shankari commented 3 years ago

@asiripanich at this point there are no errors on the R console other than

Warning in grSoftVersion() :
  unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
  libXt.so.6: cannot open shared object file: No such file or directory

The dashboard screen and the maps screen load successfully. I just can't get the tables plot to work. How are they supposed to work?

shankari commented 3 years ago

Or maybe they don't work if we have only one user. I notice that the "table below" has only one entry. I should test this on the real data 👍

asiripanich commented 3 years ago

@asiripanich at this point there are no errors on the R console other than
Warning in grSoftVersion() :
  unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
  libXt.so.6: cannot open shared object file: No such file or directory
The dashboard screen and the maps screen load successfully. I just can't get the tables plot to work. How are they supposed to work?

the tables plot work fine for me. Maybe you are missing some dependencies?

shankari commented 3 years ago

Maybe you are missing some dependencies?

I'm running this in docker. So presumably all the dependencies are in the dockerfile?

asiripanich commented 3 years ago

I haven't updated the dockerfile since my deployment. Let me check that as well.

shankari commented 3 years ago

Are you sure this is not just related to there being only one entry in the table? Do the plots work for you even when run against the test data?

asiripanich commented 3 years ago

but if i choose the 'participants' table then nothing seems to appear on the plot panel.

shankari commented 3 years ago

Ah that explains it! The participant table is selected by default. I can confirm that if I switch to trips, the plot works for me too.

shankari commented 3 years ago

Should we just remove participant from the tables?

asiripanich commented 3 years ago

All the changes look good to me. Admittedly, since I was just trying to get this to work for me asap, there are many functions should be refactored to make the codebase more maintainable and less confusing (e.g., less mixing or dplyr and data.table).

asiripanich commented 3 years ago

Should we just remove participant from the tables?

umm I think I can fix that. Let me try that first.

shankari commented 3 years ago

Yeah I got confirmation today that I can work with an intern in November. I'm sure there will be requests to expand the dashboard and we can see how to refactor at that time. I'm hoping one of the interns knows R better than me.

asiripanich commented 3 years ago

so the code that generates the plot is

 x %>%
    filter(update_ts >= "2020-10-13 17:40:29" & update_ts <= "2020-10-13 17:40:29") %>%
    ggplot() +
    aes(x = user_id, y = n_trips) +
    geom_boxplot(fill = "#0c4c8a") +
    theme_minimal()

And because of the line with filter(...), there is no data for plotting. I'm pretty sure that this worked in my deployment.

...

shankari commented 3 years ago

@asiripanich this does seem to be because there was only one participant before. I created test data for a second participant and I can now see some data in the participant table. I am still not sure if this is working properly though. I created the second participant as "ios", but when I select curr_platform I only see android on the axis and the data shows 1/2

Two entries, but only android displayed	only row 1/2?

shankari commented 3 years ago

I can confirm that in the "Trips" tab, when I click on "Data", it shows me rows Number of rows: 72 / 72 at the top. But when I click on "Data" from participants, it shows me "Number of rows: 1 / 2 ". And at the bottom, it says "Showing 1 to 2 of 2 entries" although both entries are visible, and I have selected both of them

asiripanich commented 3 years ago

I can confirm that in the "Trips" tab, when I click on "Data", it shows me rows Number of rows: 72 / 72 at the top. But when I click on "Data" from participants, it shows me "Number of rows: 1 / 2 ". And at the bottom, it says "Showing 1 to 2 of 2 entries" although both entries are visible, and I have selected both of them

For the 'participant' table, can you check what is the code that it uses to generate the plot? Click on 'Export & Code' next to the 'Data' button.

Please note that, highlighting entries on the table below the plot panel doesn't mean you are filtering the data that the plot can use. The plot panel only links to the table you are currently selecting.

shankari commented 3 years ago

Please note that, highlighting entries on the table below the plot panel doesn't mean you are filtering the data that the plot can use. The plot panel only links to the table you are currently selecting.

I see the exact same behavior even when I haven't selected anything, after closing the tab and reopening it fresh.

shankari commented 3 years ago

For the 'participant' table, can you check what is the code that it uses to generate the plot? Click on 'Export & Code' next to the 'Data' button.

data_esquisse %>%
 filter(update_ts >= "2020-10-14 16:05:42" & update_ts <= "2020-10-14 17:09:59") %>%

So clearly the problem is with the update_ts check?

shankari commented 3 years ago

The update_ts field of the two entries is

'update_ts': datetime.datetime(2020, 10, 14, 16, 5, 42, 641000),
'update_ts': datetime.datetime(2020, 10, 14, 17, 9, 59, 906000),

So it looks like it is doing query for the range of the table but is missing some milliseconds.

asiripanich commented 3 years ago

"2020-10-14 17:09:59")

From my test filter() doesn't automatically convert "2020-10-14 17:09:59" to a datetime object so that maybe another reason we don't see any data.

lubridate::ymd_hms() converts chr to POSIXct.

suppressMessages(library(dplyr))
suppressMessages(library(data.table))
suppressMessages(library(lubridate))
library(emdash)

cons = connect_stage_collections()

tidied_cleaned_trips = 
  query_cleaned_trips(cons) %>%
  tidy_cleaned_trips()
#> Finished query, about to clean trips
#> Finished cleaning trips

tidied_participants = 
  tidy_participants(query_stage_profiles(cons), query_stage_uuids(cons))
#> [1] "user_id"    "mpg_array"  "source"     "update_ts"  "client"    
#> [6] "user_email"
#> [1] "mpg_array" "source"

participants <- summarise_trips(tidied_participants, tidied_cleaned_trips)

participants = rbind(copy(participants)[1, `:=`(user_id = "test", n_trips = 100, 
                                                update_ts = lubridate::ymd_hms("2020-10-13 17:40:30"))], participants)

str(participants)  
#> Classes 'data.table' and 'data.frame':   2 obs. of  9 variables:
#>  $ user_id            : chr  "test" "192c82d3c1da4a679edaceb81693fce1"
#>  $ update_ts          : POSIXct, format: "2020-10-14 04:40:30" "2020-10-14 04:40:29"
#>  $ client             : logi  NA NA
#>  $ n_trips            : int  100 50
#>  $ n_trips_today      : int  0 0
#>  $ n_active_days      : int  10 10
#>  $ first_trip_datetime: POSIXct, format: "2016-07-22 16:26:08" "2016-07-22 16:26:08"
#>  $ last_trip_datetime : POSIXct, format: "2016-08-11 21:05:09" "2016-08-11 21:05:09"
#>  $ n_days             : num  20.2 20.2
#>  - attr(*, ".internal.selfref")=<externalptr>

# cannot filter datetime
participants %>%
  filter(update_ts >= "2020-10-13 17:40:28" & 
           update_ts <= "2020-10-13 17:40:30")
#> Empty data.table (0 rows and 9 cols): user_id,update_ts,client,n_trips,n_trips_today,n_active_days...

# can filter datetime
participants %>%
  filter(update_ts >= lubridate::ymd_hms("2020-10-13 17:40:28") & 
           update_ts <= lubridate::ymd_hms("2020-10-13 17:40:30"))
#>                             user_id           update_ts client n_trips
#> 1:                             test 2020-10-14 04:40:30     NA     100
#> 2: 192c82d3c1da4a679edaceb81693fce1 2020-10-14 04:40:29     NA      50
#>    n_trips_today n_active_days first_trip_datetime  last_trip_datetime   n_days
#> 1:             0            10 2016-07-22 16:26:08 2016-08-11 21:05:09 20.19376
#> 2:             0            10 2016-07-22 16:26:08 2016-08-11 21:05:09 20.19376

^{Created on 2020-10-15 by the reprex package (v0.3.0)}

shankari commented 3 years ago

the interesting thing is that in the trips tab, "Export & code" doesn't show anything. If we can get the participants table to also stop filtering, maybe that would be good.

shankari commented 3 years ago

I don't see where the participants table is treated differently from the trips table.

shankari commented 3 years ago

Maybe this is not too terrible - the participants table will always skip the last entry, but maybe that is good enough for now. Don't want to make you waste your time...

asiripanich commented 3 years ago

let's come back to this again if there is a need for it. :)

shankari commented 3 years ago

I have updated the test data to have two participants and generated another PR. Do you want me to cherry-pick here as well?

shankari commented 3 years ago

@asiripanich I figured out the participant table issue (https://github.com/asiripanich/emdash/pull/4/commits/2257d235935c43fa70e2757a36477b0666a6e90e). I think this is ready to merge now.

asiripanich commented 3 years ago

@asiripanich I figured out the participant table issue (2257d23). I think this is ready to merge now.

I have tested it and it doesn't fix the issue for me.

asiripanich commented 3 years ago

@asiripanich I figured out the participant table issue (2257d23). I think this is ready to merge now.

I have tested it and it doesn't fix the issue for me.

I take this back. It is working and I'm merging this now.

asiripanich / emdash

Remove all references to survey results #4