Closed manaswisaha closed 7 years ago
This is going to be @jihyukbae's initial task to contribute to Project Sidewalk.
I'll be taking this on so that we have some more informative graphs and metrics to look at to assess how the relaunch is going. Here is my current plan (looking for any feedback/suggestions):
Changes to current graphs (see screenshots in the original issue comment):
New graphs (some suggestions came from #351):
I skimmed this list and it seems reasonable on a first take. I also appreciated that you went through all open Issues looking for those that were admin dashboard related. Thanks for doing this.
One comment about "Coverage Rate per Neighbourhood % graph." I think you should actually do as suggested in https://github.com/ProjectSidewalk/SidewalkWebpage/issues/412.
Also, @misaugstad, can you go back through your list and add in an expected value proposition for each graph that you plan to create.
For the DC coverage percentage chart, do we prefer an area or line graph?
I will also bring down the interval between ticks on the y-axis from 10% points to 20.
Any other feedback on this?
I like area graph. Any reason why this is square. I think it could be more landscape (longer horizontal than vertical).
But, of course, you may have an overall layout plan for the admin page that I'm not aware of, which may justify the square design. :)
On Sat, Jun 17, 2017 at 1:20 PM, Mikey Saugstad notifications@github.com wrote:
For the DC coverage percentage chart, do we prefer an area or line graph?
[image: coverage-area] https://user-images.githubusercontent.com/6518824/27254853-62232cf8-535f-11e7-9898-bb1b4365c0f1.png [image: coverage-line] https://user-images.githubusercontent.com/6518824/27254854-622c010c-535f-11e7-8cc8-fa19439b7d06.png
I will also bring down the interval between ticks on the y-axis from 10% points to 20.
Any other feedback on this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/342#issuecomment-309228224, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi-9SmvLV2yckm_gBUOzb9RzJkWqA_Nks5sFArCgaJpZM4KVX_B .
-- Jon Froehlich Assistant Professor Computer Science University of Maryland, College Park http://www.cs.umd.edu/~jonf/ @jonfroehlich https://twitter.com/jonfroehlich - Twitter
I agree, more horizontal!
And thoughts on these histograms of severity rating by label type?
Few comments:
Will do for the x-axis ticks, didn't catch that before, thanks!
And for the y-axis scale, ideally I would keep the scale across all of them, but there are so many more curb ramps than the other labels, that the other histograms would be hard to read if we did counts. But then if we went with a proportion scale (from 0 to 1), we would be losing the counts information.
So that is my current rationale, but can still be swayed!
I do not think the y-axes should b the same for this dataset. Doing so, would completely obscure the less common label types and trends therein. If keeping the y-axis is super important, could switch it to percentage of labels; however, I prefer raw counts and scaling per type as you have it.
Other than that, I think the graphs are too tall proportionate to their width. Shrink them by 20-30% vertically or so?
Sent from my iPhone
On Jun 17, 2017, at 5:42 PM, Mikey Saugstad notifications@github.com wrote:
Will do for the x-axis ticks, didn't catch that before, thanks!
And for the y-axis scale, ideally I would keep the scale across all of them, but there are so many more curb ramps than the other labels, that the other histograms would be hard to read if we did counts. But then if we went with a proportion scale (from 0 to 1), we would be losing the counts information.
So that is my current rationale, but can still be swayed!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
They have been shrunk! Next up for review... I extended the daily label counts graph to go back to the end of 2015, now again do you prefer line or area..?
Again, I think area looks nicer.
I agree. And I think those selector zooming things (don't remember the actual names :P) would be helpful.
Best Regards, Manaswi Saha Ph.D. Student Department of Computer Science University of Maryland, College Park http://cs.umd.edu/~manaswi/ Twitter - @manaswisaha https://twitter.com/manaswisaha
On Sat, Jun 17, 2017 at 8:17 PM, Mikey Saugstad notifications@github.com wrote:
They have been shrunk! Next up for review... I extended the daily label counts graph to go back to the end of 2015, now again do you prefer line or area..? [image: daily-label-counts-area] https://user-images.githubusercontent.com/6518824/27257152-e20f8bd4-5399-11e7-9782-ccdb122011ab.png [image: daily-label-counts-line] https://user-images.githubusercontent.com/6518824/27257153-e213b1f0-5399-11e7-91f8-2e4ca6e68d02.png
Again, I think area looks nicer.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/342#issuecomment-309248298, or mute the thread https://github.com/notifications/unsubscribe-auth/ACvXgEJhO5wNqL9lLEc1jX2bOGru3Wnwks5sFGybgaJpZM4KVX_B .
Area is better.
Sent from my iPhone
On Jun 17, 2017, at 8:29 PM, Manaswi Saha notifications@github.com wrote:
I agree. And I think those selector zooming things (don't remember the actual names :P) would be helpful.
Best Regards, Manaswi Saha Ph.D. Student Department of Computer Science University of Maryland, College Park http://cs.umd.edu/~manaswi/ Twitter - @manaswisaha https://twitter.com/manaswisaha
On Sat, Jun 17, 2017 at 8:17 PM, Mikey Saugstad notifications@github.com wrote:
They have been shrunk! Next up for review... I extended the daily label counts graph to go back to the end of 2015, now again do you prefer line or area..? [image: daily-label-counts-area] https://user-images.githubusercontent.com/6518824/27257152-e20f8bd4-5399-11e7-9782-ccdb122011ab.png [image: daily-label-counts-line] https://user-images.githubusercontent.com/6518824/27257153-e213b1f0-5399-11e7-91f8-2e4ca6e68d02.png
Again, I think area looks nicer.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/342#issuecomment-309248298, or mute the thread https://github.com/notifications/unsubscribe-auth/ACvXgEJhO5wNqL9lLEc1jX2bOGru3Wnwks5sFGybgaJpZM4KVX_B .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Okay how do they look now?
But first, some of my notes:
Thanks for making these updates so quickly. We can work on the color scheme. :)
One thing I forgot to mention, I'd really like it if you could label the mean and median lines on the graph.
After you finish addressing these comments, I'd really like you to start focusing on the more user-centeric analyses in your checklist: that is, trying to better understand user behavior: how long do they spend on PS? how many missions do users complete? what does typical labeling behavior look like? what are the differences between registered vs. anonymous users? All of these questions, I think, can be best answered via histograms with descriptive stats.
The visualizations you've created thus far give us a better sense of the data but not of users particularly.
@jonfroehlich Shall I start reviewing the PR on the current changes to the admin page and integrate it with the dev server? Probably would be good for testing out these changes as well.
Yes, if you can. I'm still not sold on this whole checklist thing--seems like an abuse of the github Issue system and it's hard for me to refer to explicit checkboxes in this comment thread. I'd imagine you can't refer to explicit checkboxes in commits and pull requests either...
Yes, that's true. You can't refer to them explicitly. I think Mikey used it for his personal use to check off things he needs to complete rather than us referring to it elsewhere.
Shall I start reviewing the PR on the current changes to the admin page and integrate it with the dev server? Probably would be good for testing out these changes as well.
I definitely think this should be done ASAP. I made changes to queries that affect other parts of the tool, including adding a new table to the database, so I really want that these changes to be included in all the testing on the dev server that happens. Adding a few more histograms and making cosmetic changes can be done later, even after relaunch, and don't need as much testing.
One thing I forgot to mention, I'd really like it if you could label the mean and median lines on the graph.
Do you mean the same thing here, where "labeling" the mean and median mean putting the actual value in text overlayed on the graph?
After you finish addressing these comments, I'd really like you to start focusing on the more user-centeric analyses in your checklist: that is, trying to better understand user behavior: how long do they spend on PS? how many missions do users complete? what does typical labeling behavior look like? what are the differences between registered vs. anonymous users? All of these questions, I think, can be best answered via histograms with descriptive stats.
:+1:
I'm still not sold on this whole checklist thing--seems like an abuse of the github Issue system and it's hard for me to refer to explicit checkboxes in this comment thread. I'd imagine you can't refer to explicit checkboxes in commits and pull requests either...
I will switch them to a numbered list for easier referral.
A side note: In a lot of cases, I imagine that we would want to see the differences between pre-relaunch and post-relaunch. This is baked into the time-series graphs, but for histograms they are all combined. How do we think it would be best to visualize the differences? Have three histograms for each stat: pre-relaunch, post-relaunch, and combined? And then actually make it 6 histograms, since we would split between registered and anon users..?
Or would we want to have a button you can click, like there now is to choose sorting order for neighborhood completion %, so that there aren't just 6 graphs on the page per stat.
I think we most need this when looking at onboarding completion times. We have lengthened the tutorial for relaunch, so we don't just want that summed up with all past tutorial times.
Do you mean the same thing here, where "labeling" the mean and median mean putting the actual value in text overlayed on the graph?
Yes. Probably should be located towards the top of graph to avoid overlap with other objects
Re: pre-relaunch vs. post-relaunch. I have been thinking of this as well. Not sure how to deal with it exactly but your suggestions make sense. I would prefer that you work on this post-relaunch, however, and focus on major analytics that we already brainstormed for relaunch. Oh, also, this line of thinking brought up another point we discussed: are we tracking versions in a server database to make these sorts of queries easier? @manaswis, I remember we talked about this--did we end up adding another table that maps dates to version numbers so that we can setup semantic queries rather than date-based queries...
@jonfroehlich I created an issue for it - #653 but its not addressed yet. Relaunch system fixes took precedence. I can add that after the relaunch. We can note the date when it is launched. So it shouldn't be hard to populate it after the relaunch.
@manaswis. Right, makes sense. We can go back and populate table. Thanks.
I've made a histogram of missions completed per user (not for a single session, counting all of their logins). First question I have: should I throw everything over 100 missions into its own bin like I did with the onboarding completion time histogram? Also the standard deviation seems wrong, but that may be a copy/paste error, will look into it.
Great. Glad we are getting started on the user-centric analytics.
After a hard-fought battle with Scala, Slick, and SQl, I have won. And here is a histogram of missions completed by "anonymous users"
Making the x-axis labels integers and adding the mean/median labels to the legend or graph is still coming. But on to the more pertinent question... How do we want to define an anonymous user?
Using the criteria that IPs must have completed auditing at least one street, in the dump from December there are 169 IPs/users. If you just take any IP address that shows up in the AuditTaskInteractions table, that is about 1000 IPs.
When doing onboarding, activity is logged in the AuditTaskInteractions table, so I don't think it makes sense to just use any IP present in there as an anonymous user, since they could have just clicked on the tutorial and immediately left. This info will be useful for making retention graphs, but not for when we are looking at "anonymous users".
Another benchmark that I have considered for determining an "anonymous user" is to take anyone who finishes the tutorial, taken from the same table. However, that isn't always logged in the interactions table I think, since they could skip the tutorial or having finished the tutorial could be saved in the browser, etc. Note that having finished the onboarding is different from completing an audit task, since finishing onboarding does not log a "TaskEnd" in the table, it logs a "Onboarding_End".
We could also just take only those IPs that have finished an entire mission.
Here is another graph! Jon is messing with the data :smirk:
Yikes, we may want y-axis to be logarithmic here otherwise first bin overpowers other data...
I am realizing that it can be quite difficult to make a histogram of integers on a log scaled x-axis look good!
I think that we need a table of researchers (that you had mentioned earlier) so that we can be removed from graphs like this. 5 out of the 6 highest values in this graph are from researchers, and there is really no reason to include us in this graph. For most of the user-centric graphs, we really don't need to be included. At the very least, we need to have the option of looking at the user-centric data with us taken out.
@misaugstad: I was thinking you would do log scale on y-axis, which seems simple enough imo.
Re: table. Yes.
@ myself, in answer to my question about defining an anonymous user, it seems that there was discussion about this before I arrived, noticed when reading through #323
With a lot of the analytics that we want to look at, there are 5 groups of users that we may want to see a graph for: all users, all users minus researchers, registered users, registered users minus researchers, and anonymous users. To try and get all that information without taking up a huge amount of space, I figure that we could have a button (or something like that) that would toggle whether we include the researchers in our histograms. I coded this up for one set of histograms, pictured below:
Defaults to including researchers...
Then upon clicking "exclude researchers", the viz updates...
Thoughts?
I like it. Use a checkbox 'Include researchers' and default to off?
On Sun, Jul 2, 2017 at 7:03 PM, Mikey Saugstad notifications@github.com wrote:
With a lot of the analytics that we want to look at, there are 5 groups of users that we may want to see a graph for: all users, all users minus researchers, registered users, registered users minus researchers, and anonymous users. To try and get all that information without taking up a huge amount of space, I figure that we could have a button (or something like that) that would toggle whether we include the researchers in our histograms. I coded this up for one set of histograms, pictured below:
Defaults to including researchers... [image: including_researchers] https://user-images.githubusercontent.com/6518824/27774109-b39e2d76-5f58-11e7-8a5a-aa3811acda66.png
Then upon clicking "exclude researchers", the viz updates... [image: excluding_researchers] https://user-images.githubusercontent.com/6518824/27774110-b84768f6-5f58-11e7-86b8-45b4a9a926c6.png
Thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/342#issuecomment-312522402, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi-9cZt_a5pNIYgjjXcicxXcb935wimks5sKCGygaJpZM4KVX_B .
-- Jon Froehlich Assistant Professor Computer Science University of Maryland, College Park http://www.cs.umd.edu/~jonf/ @jonfroehlich https://twitter.com/jonfroehlich - Twitter
So here are a few new graphs for the admin interface. Any comments before I submit a PR for them so they can be included in the stress testing today?
Right now, the anonymous user labels are incorrect, in the same way as issue #791. I have an idea for how to fix it, but if that doesn't work, the fix may not happen today. The way to fix it is to use a select distinct, but Slick doesn't have select distinct directly built in (well, maybe they do in the newest version, but that isn't well documented yet anyway). So my idea is for a workaround.
Also, the following graphs are now bar graphs instead of area graphs. The area versions can be seen at this comment. (Note that the histograms next to these graphs are not pictured, and that is where the legends are)
@r-holland is going to get started on making a graph of time spent using Project Sidewalk, where we count 5+ minutes of inactivity as not using the tool. This should be a good intro to our backend, while providing something very useful for the dashboard!
Sounds good. Thanks Mikey.
On Mon, Jul 3, 2017 at 11:43 AM, Mikey Saugstad notifications@github.com wrote:
@r-holland https://github.com/r-holland is going to get started on making a graph of time spent using Project Sidewalk, where we count 5+ minutes of inactivity as not using the tool. This should be a good intro to our backend, while providing something very useful for the dashboard!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/342#issuecomment-312679044, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi-9VBRCM-CrI7Bv5B0xPE01RPTAjLaks5sKQwZgaJpZM4KVX_B .
-- Jon Froehlich Assistant Professor Computer Science University of Maryland, College Park http://www.cs.umd.edu/~jonf/ @jonfroehlich https://twitter.com/jonfroehlich - Twitter
Looks like I have most everything working:
Unfortunately, @misaugstad explained to me that a query similar to the one I am conducting caused the server to crash a while ago. I will work on implementing more intermediate calculations on the back end to hopefully reduce the query size.
Thanks @r-holland.
@jonfroehlich Does this updated graph address your concerns?
Can you show me this w 5 min and 10 min bins in addition to 20. Also, did you switch out the compute function for this.
Sent from my iPhone
On Jul 11, 2017, at 8:44 AM, Ryan Holland notifications@github.com wrote:
@jonfroehlich Does this updated graph address your concerns?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I am still in the process of switching the computation function to the back end, been busy with other issues. Plan is to get this done today.
Thanks. I like 5 or 10 min bins the best.
On Tue, Jul 11, 2017 at 9:18 AM, Ryan Holland notifications@github.com wrote:
[image: audit_time_histogram_updated] https://user-images.githubusercontent.com/19720010/28072845-27444f2e-6622-11e7-8f73-c77932b77560.PNG [image: audit_time_histogram_updated_2] https://user-images.githubusercontent.com/19720010/28072843-2740c73c-6622-11e7-80b0-cba270155b51.PNG [image: audit_time_histogram_updated_3] https://user-images.githubusercontent.com/19720010/28072844-274243aa-6622-11e7-8bee-8d86d22af75c.PNG
I am still in the process of switching the computation function to the back end, been busy with other issues. Plan is to get this done today.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/342#issuecomment-314458594, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi-9dWokKVw-xT3pCqrDbm3alOLXBrxks5sM4QqgaJpZM4KVX_B .
-- Jon Froehlich Assistant Professor Computer Science University of Maryland, College Park http://www.cs.umd.edu/~jonf/ @jonfroehlich https://twitter.com/jonfroehlich - Twitter
Thanks to the discovery of lag(), my new favorite SQL function, I have dropped the total query time to about 2 seconds. All of the calculations are back end now. @misaugstad also suggested limiting the query to only ModeSwitch events...I will let him explain his own reasoning:
"ModeSwitch_Walk is logged during both panning and "walking", and the other mode switches are for the different label types. So if no one pans, changes pano, or switches to a labeling mode for 5 min, they probably aren't doing anything really."
Here is the updated histogram:
@r-holland how is it possible that after removing some of the interactions, the average amount of time spent actually went up? Shouldn't it be that removing interactions should result in, at most, the same amount of time spent as with all interactions?
You should run your current implementation, but for all interactions, and make sure that it looks the same as your original graphs. And if it doesn't, we need to find out why
Also this is only looking at registered users, which should be mentioned. And we should do this for anonymous users as well. And that should not be hard at all, you just group by IP address instead of user id
Is this Issue still active or should we close it out?
Yep, there were just a couple remaining charts that had not been created, so I made separate issues for each of them. Closing this one now. Nice work, team!
Currently, these aren't very clear to the user and aren't useful. Things that can be improved:
Screenshots from my local system:
More improvements can be added when we get around to working on this.