Closed shankari closed 3 years ago
https://github.com/asiripanich/emdash/issues/2#issuecomment-708052926
About this comment, the uuid
field was dropped and renamed to user_id
using normalise_uuid()
. As is, uuid
is a list column, so it has to be flattened into a char. Do you prefer consistent naming to the original data?
Please send me your test database so I can review this PR.
About this comment, the
uuid
field was dropped and renamed touser_id
usingnormalise_uuid()
. As is,uuid
is a list column, so it has to be flattened into a char. Do you prefer consistent naming to the original data?
@shankari List-columns is an R tidyverse thing I could help with :) actually I think Python pandas and Julia added list-columns too recently...
About this comment, the uuid field was dropped and renamed to user_id using normalise_uuid(). As is, uuid is a list column, so it has to be flattened into a char. Do you prefer consistent naming to the original data?
@asiripanich @ahcyip the problem is that there is no uuid
field in the trips dataframe. The output of the print(names(.data))
line
https://github.com/asiripanich/emdash/pull/4/files#diff-c750c3f7803731b61a50428fb1429cd242398df6a8483044b54c7997e2f7ce54R75
is
[1] "user_id" "metadata.key"
[3] "metadata.platform" "metadata.write_ts"
[5] "metadata.time_zone" "metadata.write_local_dt.year"
[7] "metadata.write_local_dt.month" "metadata.write_local_dt.day"
[9] "metadata.write_local_dt.hour" "metadata.write_local_dt.minute"
[11] "metadata.write_local_dt.second" "metadata.write_local_dt.weekday"
[13] "metadata.write_local_dt.timezone" "metadata.write_fmt_time"
[15] "data.source" "data.end_ts"
[17] "data.end_local_dt.year" "data.end_local_dt.month"
[19] "data.end_local_dt.day" "data.end_local_dt.hour"
[21] "data.end_local_dt.minute" "data.end_local_dt.second"
[23] "data.end_local_dt.weekday" "data.end_local_dt.timezone"
[25] "data.end_fmt_time" "data.end_loc.type"
[27] "data.end_loc.coordinates" "data.raw_trip"
[29] "data.start_ts" "data.start_local_dt.year"
[31] "data.start_local_dt.month" "data.start_local_dt.day"
[33] "data.start_local_dt.hour" "data.start_local_dt.minute"
[35] "data.start_local_dt.second" "data.start_local_dt.weekday"
[37] "data.start_local_dt.timezone" "data.start_fmt_time"
[39] "data.start_loc.type" "data.start_loc.coordinates"
[41] "data.duration" "data.distance"
[43] "data.start_place" "data.end_place"
There is no uuid column.
I'm reviewing the PR right now. Should not be hard to resolve this issue. :)
@asiripanich but that error is before the rename.
The print statement is at line 75
The uuid column check is line 76
The rename to user_id
and the delete of the uuid
field is on lines 80 and 81-83
I don't see how this can be due to the rename/refactor in that case, unless I really don't understand R at all.
@ahcyip I don't think the problem is with list columns at this point because the check (line 76) is before all the complicated stuff with the columns (line 80, sapply(uuid, function(.x) paste0(unlist(.x), collapse = ""))]
).
Or maybe I just don't understand R 😄
To say the least, I had created a bug in 3cb9df924582507d3eb9b3da554c7e4b25e3b66c, so I reverted that commit and removed all references to survey results. Please try it on your machine and let me know how it goes.
Once you confirm that this is working, please consider merging the master branch into this feature branch so that the new CI can check this PR.
Good catch on the .gitignore
- I should have included that.
Just for my own edification, the main change to normalise_uuid
was to change from if (! uuid column) stop
to if (uuid column) { do normalization }
right?
I am pulling and testing this now.
@asiripanich @ahcyip I ran into the error
About to load trips
Finished query, about to clean trips
Warning: Error in -: invalid argument to unary operator
[No stack trace available]
Warning in min(data_r$participants$update_ts) :
no non-missing arguments to min; returning Inf
After adding some additional logging, I figured out that this was in tidy_cleaned_trips
. Since that is effectively a giant function chain, and I'm not sure how to add logging to a function chain, I commented out each function that had a unary operator. It turned out that the problem was with
{
sf::st_sf(.[, -"geometry"],
geometry = .[["geometry"]],
crs = project_crs)
} %>%
If I remove that, the execution moves a lot further
About to load trips
Finished query, about to clean trips
Finished cleaning trips
Finished loading trips
About to load participants
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token", :
Column 'uuid' does not exist to remove
...
It looks like the problem is that you can't use a string when dropping columns?
data["601",] doesn't generate an error because you can also refer to a
row by its name, as an alternative to refering to it by row number.
It's the same with vectors, just consider the following case.(x <- c("601"=1, b=2)) x[1] x["601"] # the same
But when you want to remove it you must negate an index number so
x[-"601"] is wrong for reasons already explained.
Fix seems to be to use dplyr::select
sf::st_sf(dplyr::select(., -geometry),
geometry = .[["geometry"]],
crs = project_crs)
Now the error is
About to load participants
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token", :
Column 'uuid' does not exist to remove
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token", :
Column 'device_token' does not exist to remove
Warning in `[.data.table`(., , `:=`(c("uuid", "source", "device_token", :
Column 'curr_sync_interval' does not exist to remove
I have to work on something else for a bit but will return to this after that.
Next set of errors
Warning: Error in sf::st_drop_geometry: st_drop_geometry only works with objects of class sf
[No stack trace available]
Warning in min(data_r$participants$update_ts) :
no non-missing arguments to min; returning Inf
Warning in min(data_r$participants$update_ts) :
no non-missing arguments to min; returning Inf
Warning: Error in UseMethod: no applicable method for 'mutate_' applied to an object of class "NULL"
[No stack trace available]
Warning: Error in UseMethod: no applicable method for 'mutate_' applied to an object of class "NULL"
[No stack trace available]
Warning in grSoftVersion() :
unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
libXt.so.6: cannot open shared object file: No such file or directory
@asiripanich can you check https://github.com/asiripanich/emdash/pull/4/commits/1c28ef78138d0db5e6c209d5d20584ad965c6f79
I am not sure if it is correct to do that. Without it, I get the error Warning: Error in sf::st_drop_geometry: st_drop_geometry only works with objects of class sf
. But we do convert to an st object in R/utils_tidy_data.R
and add a geometry.
I am not sure why we need to add a geometry and then remove it....
It looks like I'm too late but you can use %T>% to add print statements and keep the pipe going: https://stackoverflow.com/questions/61196304/magrittr-tee-pipe-t-equivalent
Ah so this error
https://github.com/asiripanich/emdash/pull/4#issuecomment-708641799
appears to be because data_r$trips
is null, even after tidy_cleaned_trips
is done
message("Finished loading trips")
print(data_r$trips)
outputs
Finished loading trips
NULL
That is also probably the root cause for many of the subsequent errors as well.
Ah adding a return(cleaned_trips_sf)
to the end of tidy_cleaned_trips
has got us almost all the way there.
Also reverted https://github.com/asiripanich/emdash/commit/1c28ef78138d0db5e6c209d5d20584ad965c6f79 since the problem was the NULL
value.
Remaining errors
Finished loading participants
Warning in grSoftVersion() :
unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
libXt.so.6: cannot open shared object file: No such file or directory
Warning: Error in FUN: object 'branch' not found
[No stack trace available]
Warning: Error in FUN: object 'curr_platform' not found
[No stack trace available]
But the dashboard is actually visible finally. Woo! Woo! The end of the tunnel is in sight.
ok so the final two entries are because:
curr_platform
is only set when we have push notifications workingI will remove the client graph (@asiripanich any thoughts on other data to put there?) I will make sure that push notifications work in the android emulator to get the second graph to work.
@asiripanich at this point there are no errors on the R console other than
Warning in grSoftVersion() :
unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
libXt.so.6: cannot open shared object file: No such file or directory
The dashboard screen and the maps screen load successfully. I just can't get the tables plot to work. How are they supposed to work?
Or maybe they don't work if we have only one user. I notice that the "table below" has only one entry. I should test this on the real data 👍
@asiripanich at this point there are no errors on the R console other than
Warning in grSoftVersion() : unable to load shared object '/usr/local/lib/R/modules//R_X11.so': libXt.so.6: cannot open shared object file: No such file or directory
The dashboard screen and the maps screen load successfully. I just can't get the tables plot to work. How are they supposed to work?
the tables plot work fine for me. Maybe you are missing some dependencies?
Maybe you are missing some dependencies?
I'm running this in docker. So presumably all the dependencies are in the dockerfile?
I haven't updated the dockerfile since my deployment. Let me check that as well.
Are you sure this is not just related to there being only one entry in the table? Do the plots work for you even when run against the test data?
but if i choose the 'participants' table then nothing seems to appear on the plot panel.
Ah that explains it! The participant table is selected by default. I can confirm that if I switch to trips, the plot works for me too.
Should we just remove participant
from the tables?
All the changes look good to me. Admittedly, since I was just trying to get this to work for me asap, there are many functions should be refactored to make the codebase more maintainable and less confusing (e.g., less mixing or dplyr and data.table).
Should we just remove
participant
from the tables?
umm I think I can fix that. Let me try that first.
Yeah I got confirmation today that I can work with an intern in November. I'm sure there will be requests to expand the dashboard and we can see how to refactor at that time. I'm hoping one of the interns knows R better than me.
so the code that generates the plot is
x %>%
filter(update_ts >= "2020-10-13 17:40:29" & update_ts <= "2020-10-13 17:40:29") %>%
ggplot() +
aes(x = user_id, y = n_trips) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
And because of the line with filter(...)
, there is no data for plotting. I'm pretty sure that this worked in my deployment.
...
@asiripanich this does seem to be because there was only one participant before. I created test data for a second participant and I can now see some data in the participant table. I am still not sure if this is working properly though. I created the second participant as "ios", but when I select curr_platform
I only see android
on the axis and the data shows 1/2
Two entries, but only android displayed | only row 1/2? |
---|---|
I can confirm that in the "Trips" tab, when I click on "Data", it shows me rows Number of rows: 72 / 72
at the top. But when I click on "Data" from participants, it shows me "Number of rows: 1 / 2 ". And at the bottom, it says "Showing 1 to 2 of 2 entries" although both entries are visible, and I have selected both of them
I can confirm that in the "Trips" tab, when I click on "Data", it shows me rows
Number of rows: 72 / 72
at the top. But when I click on "Data" from participants, it shows me "Number of rows: 1 / 2 ". And at the bottom, it says "Showing 1 to 2 of 2 entries" although both entries are visible, and I have selected both of them
For the 'participant' table, can you check what is the code that it uses to generate the plot? Click on 'Export & Code' next to the 'Data' button.
Please note that, highlighting entries on the table below the plot panel doesn't mean you are filtering the data that the plot can use. The plot panel only links to the table you are currently selecting.
Please note that, highlighting entries on the table below the plot panel doesn't mean you are filtering the data that the plot can use. The plot panel only links to the table you are currently selecting.
I see the exact same behavior even when I haven't selected anything, after closing the tab and reopening it fresh.
For the 'participant' table, can you check what is the code that it uses to generate the plot? Click on 'Export & Code' next to the 'Data' button.
data_esquisse %>%
filter(update_ts >= "2020-10-14 16:05:42" & update_ts <= "2020-10-14 17:09:59") %>%
So clearly the problem is with the update_ts check?
The update_ts field of the two entries is
'update_ts': datetime.datetime(2020, 10, 14, 16, 5, 42, 641000),
'update_ts': datetime.datetime(2020, 10, 14, 17, 9, 59, 906000),
So it looks like it is doing query for the range of the table but is missing some milliseconds.
"2020-10-14 17:09:59")
From my test filter()
doesn't automatically convert "2020-10-14 17:09:59" to a datetime object so that maybe another reason we don't see any data.
lubridate::ymd_hms()
converts chr
to POSIXct
.suppressMessages(library(dplyr))
suppressMessages(library(data.table))
suppressMessages(library(lubridate))
library(emdash)
cons = connect_stage_collections()
tidied_cleaned_trips =
query_cleaned_trips(cons) %>%
tidy_cleaned_trips()
#> Finished query, about to clean trips
#> Finished cleaning trips
tidied_participants =
tidy_participants(query_stage_profiles(cons), query_stage_uuids(cons))
#> [1] "user_id" "mpg_array" "source" "update_ts" "client"
#> [6] "user_email"
#> [1] "mpg_array" "source"
participants <- summarise_trips(tidied_participants, tidied_cleaned_trips)
participants = rbind(copy(participants)[1, `:=`(user_id = "test", n_trips = 100,
update_ts = lubridate::ymd_hms("2020-10-13 17:40:30"))], participants)
str(participants)
#> Classes 'data.table' and 'data.frame': 2 obs. of 9 variables:
#> $ user_id : chr "test" "192c82d3c1da4a679edaceb81693fce1"
#> $ update_ts : POSIXct, format: "2020-10-14 04:40:30" "2020-10-14 04:40:29"
#> $ client : logi NA NA
#> $ n_trips : int 100 50
#> $ n_trips_today : int 0 0
#> $ n_active_days : int 10 10
#> $ first_trip_datetime: POSIXct, format: "2016-07-22 16:26:08" "2016-07-22 16:26:08"
#> $ last_trip_datetime : POSIXct, format: "2016-08-11 21:05:09" "2016-08-11 21:05:09"
#> $ n_days : num 20.2 20.2
#> - attr(*, ".internal.selfref")=<externalptr>
# cannot filter datetime
participants %>%
filter(update_ts >= "2020-10-13 17:40:28" &
update_ts <= "2020-10-13 17:40:30")
#> Empty data.table (0 rows and 9 cols): user_id,update_ts,client,n_trips,n_trips_today,n_active_days...
# can filter datetime
participants %>%
filter(update_ts >= lubridate::ymd_hms("2020-10-13 17:40:28") &
update_ts <= lubridate::ymd_hms("2020-10-13 17:40:30"))
#> user_id update_ts client n_trips
#> 1: test 2020-10-14 04:40:30 NA 100
#> 2: 192c82d3c1da4a679edaceb81693fce1 2020-10-14 04:40:29 NA 50
#> n_trips_today n_active_days first_trip_datetime last_trip_datetime n_days
#> 1: 0 10 2016-07-22 16:26:08 2016-08-11 21:05:09 20.19376
#> 2: 0 10 2016-07-22 16:26:08 2016-08-11 21:05:09 20.19376
Created on 2020-10-15 by the reprex package (v0.3.0)
the interesting thing is that in the trips tab, "Export & code" doesn't show anything. If we can get the participants table to also stop filtering, maybe that would be good.
I don't see where the participants table is treated differently from the trips table.
Maybe this is not too terrible - the participants table will always skip the last entry, but maybe that is good enough for now. Don't want to make you waste your time...
let's come back to this again if there is a need for it. :)
I have updated the test data to have two participants and generated another PR. Do you want me to cherry-pick here as well?
@asiripanich I figured out the participant table issue (https://github.com/asiripanich/emdash/pull/4/commits/2257d235935c43fa70e2757a36477b0666a6e90e). I think this is ready to merge now.
@asiripanich I figured out the participant table issue (2257d23). I think this is ready to merge now.
I have tested it and it doesn't fix the issue for me.
@asiripanich I figured out the participant table issue (2257d23). I think this is ready to merge now.
I have tested it and it doesn't fix the issue for me.
I take this back. It is working and I'm merging this now.
This is a first step towards splitting the code into the common functionality, which deals with sensed and analysed results, and the study-specific functionality, which will go into another module in the future.
This removes the existing implementations of
query_trip_ends
,query_user_profile_survey
, andtidy_trip_ends
, and all references to them.The one place where we didn't just remove the reference was in
most_recent_trip_ends
. The subsequent code deals with missing recent trip ends, so we setmost_recent_trip_ends
to an empty vector instead of removing the subsequent references.