Final Peer Review (ar2293)

This project tries to choose an accurate model to predict soccer games outcomes based on 25,000 FIFA matches in European. The team considered many relevant features that might affect the outcome, which includes team formation, player's rating, head to head record, betting odds and output space. The team also used at least three methods from class to model the data. Overall, the report is well-structured and well-written, with good visualizations and explanations.

In the data cleaning part, the team gives a brief description about their process. As an audience, I would like to know more details about how you clean the data and why you choose to delete the rows instead of making predictions on the missing data. The part also did not mention about other aspect of data cleaning, for example, how to deal with outliers, is there any need of standardization of the data, etc.

In data processing part, the team put many colorful visualizations about the distribution of data, which is a good help for the audiences to understand the data. However, I do not understand the usefulness of visualization of team formation. I guess if you would like to know how team format could affect match outcome, maybe a frequency table of team formation data would be more helpful than the layout of the team formation. The figures in player's rating looks nice, but may be more explanation of the figure itself would be more beneficial, for example, what does each axis mean, etc. The team also uses transformations of overall rating, I'm not sure why the function makes a good transformation. More explanation may be needed.

In modeling part, the team uses linear regression, multi-class classification and decision tree. The team is on the track of starting from the most simple model and try more complex model to improve better accuracy. I'm not sure why linear regression would be a better fit than quadratic regression or any other kinds of regression. Maybe the team need to take a closer look at the distribution of the data before they decide which regression model they could use. When fitting the regression model, the team did not consider data normalization. But since the data of features have different ranges, standardizing the data into same range would yield more accurate results in feature engineering.

The classification model is more convincing to me. The use of binary encoding is appropriate. I noticed that the accurate rate of the models are not very hight. When it happens, the team could try different methods and rethink about their data. For example, do the outliers have great effects on the model accuracy, are the data normalized, did you test out different test/training split or using cross validation approach, did you tried different loss functions and regularizers, etc.?

Overall this is a good report with a lot of information. Although the final model did not yield very high accuracy, the team explain well on the features they are interested and models they selected. The report is separate into well-structured sections and the report is very readable to audiences.

ben1605 / Soccer-match-outcome-prediction

Final Peer Review (ar2293) #8