This project aims to predict football results and explain the predictions. Those things are done using XGBoost
and SHAP
respectively. There is also a streamlit
app to visualize / track performance.
Initial data is taken from: https://www.football-data.co.uk/
Data is then enriched with features, mainly consisting of stats (goals, shots, fouls, etc.). It is tricky to reconstruct what the table looked like before the game, for that I made a LeagueTable
class. Check out src/preprocess/features.py
for more detail.
For storage I use Azure Blob. I wrote common read/write methods to simplify data management, check out src/storage/tables.py
for more detail.
I used XGBoost
as it's a highly performing algorithm on tabular data. It also handles missing data well, which is important as in this project there are naturally some missing features which should not be imputed (at the start of the season, all statistics are empty for each team).
I used hyperopt
to optimize the model. It accepts any abstract objective function with the following form f(X) -> y
where y
is a float, but I used a standard RMSE output to optimize the model. Please check out train.ipynb
and src/modelling/experiment.py
for more detail.
NOTE: Training is fully offline & local. Eventually I want to move to Azure ML / Synapse but for now that's too much work for a pet project :^)
XGBoost also pairs well with the SHAP
package which helps to explain each prediction by assigning marginal impact on the output to each input variable. To be perfectly honest, this part of the project has been underwhelming as most of the variation in the predictions are usually explained by the odds.
Source: https://github.com/slundberg/shap
Frontend was done with streamlit
, I would highly recommend it for anyone who hates doing frontend as much as I do. Really easy to use. Pretty limited framework but that's the point. A previous version of this project used the normal JS/CSS/HTML stack and I hated it so much.
Deployment is done with Azure App Service + Github Actions for CI/CD.
It's straightforward, just don't forget to specify the startup command in the portal.
crontab -e
and press 2
*/10 * * * * date > /tmp/test.txt
(TODO: determine if this is even needed, but for now I'm using it to keep the app alive)0 0 * * 3,6 {full_python_path} run_pipelines.py
(3 and 6 denotes Wednesday and Saturday, approximately when the fixtures get refreshed){full_python_path}
by running which python