This project collated data from three different sources - a database of 25,000 European soccer matches, betting odds for each match, and player and team attributes from FIFA. It used this data to perform an analysis on each player to determine the outcomes of soccer matches.
Comments on the project:
Although I really like LaTeX files I have to say that making this an .md file makes it very readable and I enjoyed reading it, aesthetically speaking.
For your data cleaning step, you deleted the matches that missed player's ID and position info. I would like to have seen an analysis of the dropped data, since it could arguably be correlated (ex. statistics aggregators may be more likely to make sure their information on popular players/matches is more accurate than data for low-ranked teams). Especially given that this is a significant amount of data (~20% of your full dataset), this analysis would have been great. Although in all honesty that's more me being curious because your assumption is not unreasonable to make.
Not an issue but I thought your categorizations of team positions into 35 different formations was great and I would have loved to see a full gallery of all 35.
I really liked the use of the correlation matrix as a sanity-check for your feature reduction; without it it wouldn't be particularly obvious that you can combine all of the columns into one for each team. I would have liked to see a variance plot showing the percentage of variance explained by each eigenvector for your dimensional reduction.
The structure of your approaches (linear regression to multiclass classification to perceptron to hinge loss to etc.) was great and showed how you thought through each step.
I would have liked to see the comparisons of your results versus the betting lines and the FIFA stats. Obviously the betting lines and FIFA stats are themselves aggregates of a lot of other data so I am curious to see how much of an improvement your method is over using that data on its own.
Along the lines of the previous point, I'd really like to see how much money you would win (or perhaps lose) if you used the results of your algorithm to make bets on every match in your dataset). This would be an interesting error metric and if your algorithm is good enough you could definitely make money from it.
Great use of gifs at the beginning and end of the project; they really emphasize how dynamic your findings are and the importance of this project to the sport of soccer. I would like a source on the soccer gif at the end of the paper though.
Did you perform cross-validation on your dataset? You had enough points to do so given the number of feature vectors you had and this could have let you choose a better regularizer or weights for your loss functions. This would be a great improvement to make in the future.
Overall, I enjoyed this project a lot. The results are solid and backed up mathematically, and it has real-world applications. There are several extensions to this paper that would make good follow-up lines of investigation and there are also several things that I would like to see in this paper but on the whole it is a great final project and was fun to read.
This project collated data from three different sources - a database of 25,000 European soccer matches, betting odds for each match, and player and team attributes from FIFA. It used this data to perform an analysis on each player to determine the outcomes of soccer matches.
Comments on the project:
Overall, I enjoyed this project a lot. The results are solid and backed up mathematically, and it has real-world applications. There are several extensions to this paper that would make good follow-up lines of investigation and there are also several things that I would like to see in this paper but on the whole it is a great final project and was fun to read.