Open agitter opened 4 years ago
We may also want to consider a more complex dataset that has more features to introduce overfitting.
Logisitic regression did not give a solution with L1 regularization that ignored one of the features
@cmilica and I discussed this example today. The logistic regularization regularization example uses simulated_t_cells_1.csv
. This simulated data is a 2D dataset from the sklearn make moons function. However, it is not very intuitive. The ideal classifier for this dataset should have low regularization so that both features are used.
We can instead start with the real T cell data size_intensity_feature.csv
. It has two real features, size and intensity. We will add two more random features. Tentatively those could be the pixel intensity of pixel 100 and the intensity of pixel 200. The idea is that these two pixels are arbitrarily picked and should not be informative about the class label. To generate the two random features, we could sample positive values from a Gaussian distribution.
We'll end up with 4 features. The idea L1 regularization will put no weight on the two pixel intensity features and some weight on the two real features. We can find values of C that the participants should try. At one extreme, there will be too little regularization so all 4 features will have some weight. At the other extreme, there will be too much regularization and only 0 or 1 of the real features will have weight. The sweet spot will weight only the 2 real features. Participants can use the validation evaluation metrics to see what value of C is best.
Before using the software, we can also ask questions to build intuition about what regularization the best logistic regression model should have.
If this sounds okay, the next step will be to add a short Python script to read in the T cell data, add the two random features, and write out the augmented dataset.
I was curious whether the T cell dataset with CellProfiler features would be a good example dataset. Unfortunately, it is not. Here are the validation set results when varying the C hyperparameter for a L1 regularized logistic regression classifier:
I made the validation and test sets as large as possible. Stronger regularization (smaller C) hurts the performance.
We may be better off using a real dataset to illustrate overfitting. The Penn ML benchmarks could be one possible source.
There is a Kaggle rental properties datasets that is similar to the hypothetical housing example we use to introduce machine learning: https://www.kaggle.com/arashnic/property-data
It is possible to train a logistic regression classifier that ignores one or both of the features in the example dataset with a suitable value of C. However, we need to revise the recommended C values in the lesson. They need to include stronger regularization. We'll need to test this with multiple training-tuning splits to make sure the desired behavior appears consistently.
This logistic regression dataset will not demonstrate overfitting but does demonstrate regularization.
After changing the values of C in c82fe8fc75eb3804c395460edc9c5ed944fcd5e4, the logistic regression example now shows 0, 1, or 2 features with non-zero feature weights in the L1 regularization example. This may be sensitive to the type of holdout and cross-validation selected. The defaults in ml4bio v0.1.4 were used to set the values of C.
There were a few instances were our sample datasets did not give the desired outcome, which made it hard to impress the points we wanted to make about hyperparameter or model selection:
Part of the challenge may be the random data splitting. Do we need to introduce an explicit seed? Would that help introduce reproducibility concepts or complicate the workflow too much?