ProjectSidewalk / sidewalk-quality-analysis

An analysis of Project Sidewalk user quality based on interaction logs
5 stars 3 forks source link

Feature Selection Experiments #46

Open jonfroehlich opened 5 years ago

jonfroehlich commented 5 years ago

In https://github.com/ProjectSidewalk/sidewalk-quality-analysis/issues/18#issuecomment-519696956, @nchowder reported some initial experimental findings with recursive feature selection (neat!).

image

In general, as the graph shows, our performance improves as we add input features (nice result!); however, I was quite surprised to see that we maxed out at only 9 input features. Haven't we brainstormed and discussed closer to 50? Once we run the feature selection algorithm with a larger input feature set, I'd also like you to report back on which features were most influential/helpful to the model.

nch0w commented 5 years ago

So here is an update on this issue. There are actually two recursive feature selections running, one for the label classifier and one for the user accuracy classifier.

These are the results of feature selection for the label classifier: Screenshot_2019-08-22 JupyterLab Optimal number of features : 12 Mask : [ True True True True True True False True True True True True True]

The features it's using are ['label_type', 'sv_image_y', 'canvas_x', 'canvas_y', 'heading', 'pitch', 'zoom', 'lat', 'lng', 'proximity_distance', 'proximity_middleness', 'CLASS_DESC', 'ZONEID']

Which means that the one feature that was eliminated was 'zoom'.

I will update this issue with the results of feature selection on the user accuracy classifier soon. We have a lot more accuracy features than label features.