ProjectSidewalk / sidewalk-quality-analysis

An analysis of Project Sidewalk user quality based on interaction logs
5 stars 3 forks source link

Interesting analysis: how much data do we need per user before getting accurate predictions #27

Open jonfroehlich opened 5 years ago

jonfroehlich commented 5 years ago

A rather large but interesting and important analysis is: how much interaction log data do we need per user to accurately infer whether they are a "good" or "bad" user?

Roughly, the way to do this is to graph prediction accuracy as a function of amount of data. So, how well does our model predict user quality after the tutorial, after one mission, after two missions, after three, etc.

jonfroehlich commented 5 years ago

This is probably one of the highest priority analyses I'd like to see completed.

nch0w commented 5 years ago

Screenshot_2019-08-29 JupyterLab(1) So I did an initial analysis of this. This is running an SVM on features extracted from the first n panos visited by each user, with recursive feature elimination. Users who had visited fewer than the n panos were filtered out of analysis.

It looks like recall decreases as the number of panos increases, but the precision stays about the same.

*the first point is n=5 panos

daotyl000 commented 5 years ago

This is the distribution of how many users have seen a certain amount of panos

'10+ panos: 342 users' '25+ panos: 264 users' '50+ panos: 188 users' '100+ panos: 91 users' '200+ panos: 47 users' '300+ panos: 35 users' '400+ panos: 33 users' '500+ panos: 32 users'