ClimbsRocks / data-formatter

Takes raw csv input and formats it to be ready for neural networks
19 stars 7 forks source link

FUTURE: add in a subset of new features, prune out the non-useful ones, repeat #59

Open ClimbsRocks opened 8 years ago

ClimbsRocks commented 8 years ago

right now anytime we add in new features (polynomialFeatures.py, groupBy.py, imputingMissingValues.py, etc.), we add them all in at once as a big group.

and then, only at some much later point in time, once we've aggregated together all these new features, do we perform feature selection.

it might make much more sense to perform feature selection at the end of each file where we add in new features.

and then, if we wanted to optimize further, it might make more sense to add in only a subset of new features, perform feature selection, and then add in the next subset of new features.

this would add additional calculation time while saving on memory.

this would ensure that any features that ultimately made it through so many rounds of feature selection were really, really robust.

however, it probably cuts out some marginally useful/borderline features, which may make the cut one time but not the next.

what this would ultimately end up doing is letting us try many more features. since the useless features will be pruned very quickly, we can try adding in many more things, without worrying about creating a memory explosion. it's highly unlikely that all of our feature engineering is going to be useful, but it is highly likely that some of it will be. i would rather have the opportunity to try everything, and let the data decide what's best for this particular dataset.

we do, of course, risk overfitting, but we're using so much cross-validation that i'm not too concerned.