MS2 - Give it your best shot! (20%)

TimkLee commented 2 years ago

[x] Try 3-4 methods/techniques to create the best model for predicting expected goals.
[x] Generate figures and include links

TimkLee commented 2 years ago

Notes:

The goal here is to try a wide variety of things rather than trying to hyper-optimize a single approach.

TimkLee commented 2 years ago

Different model types such as neural networks, decision trees, clustering, etc.
- Neural network using Pytorch with all available features, one hidden layer but multiple hidden neurons, vs few hidden neurons but multiple layers.
- Decision tree using sklearn with all available features (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier)
- SVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
More advanced feature selection strategies
- Applying PCA (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html?highlight=pca#sklearn.decomposition.PCA.fit)
- SelectKBest (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
- SelectPercentile (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html)
- Fisher's Score (https://github.com/jundongl/scikit-feature/blob/master/skfeature/function/similarity_based/fisher_score.py)
- sklearn.feature_selection methods (https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination)
- More thoughtful splitting of the training data (For example, motivated by the fact that teams often start completely healthy but then lose players due to injury as the season progresses, you could omit the first 20 games per team, or select the training/validation sets by alternating games.)
- If the characteristic of the regular and post season games are inherently different or skewed. It would be interesting to train a model using regular season games first then test using post season games and vice versa.
- Alternatively, separate the regular and the post season games first, then use the shoe strap method to generate as much post season games as the regular games before randomly selecting the training and validation sets.
- Exploring applying a bootstrap strategy to increase the number of goal events, creating a more balanced dataset.
Hyperparameter tuning, cross validation strategies
- Explore k-fold or leave-one-out cross-validation (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
Regularization
- Explore L1 vs L2 for logistic regression model.
- L2 Ridge (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
- L1 Lasso (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)

6758-Project / hockey

MS2 - Give it your best shot! (20%) #28