Open jonfroehlich opened 5 years ago
So here is an update on this issue. There are actually two recursive feature selections running, one for the label classifier and one for the user accuracy classifier.
These are the results of feature selection for the label classifier: Optimal number of features : 12 Mask : [ True True True True True True False True True True True True True]
The features it's using are ['label_type', 'sv_image_y', 'canvas_x', 'canvas_y', 'heading', 'pitch', 'zoom', 'lat', 'lng', 'proximity_distance', 'proximity_middleness', 'CLASS_DESC', 'ZONEID']
Which means that the one feature that was eliminated was 'zoom'.
I will update this issue with the results of feature selection on the user accuracy classifier soon. We have a lot more accuracy features than label features.
In https://github.com/ProjectSidewalk/sidewalk-quality-analysis/issues/18#issuecomment-519696956, @nchowder reported some initial experimental findings with recursive feature selection (neat!).
In general, as the graph shows, our performance improves as we add input features (nice result!); however, I was quite surprised to see that we maxed out at only 9 input features. Haven't we brainstormed and discussed closer to 50? Once we run the feature selection algorithm with a larger input feature set, I'd also like you to report back on which features were most influential/helpful to the model.