The goal of the project is to predict the outcome of soccer match using data including the player rating and betting information.
Here are a few thinks I like about the current method:
I like that you've ran linear regression so that you know what the baseline should be for the problem. Now you have a good estimate of the difficulty of the problem.
I like that you're already think about over-underfitting issue.
Here are a few suggestions I can think of
Instead of using discrete values {-1,1} to represent betting odd, maybe consider using the actual value because a 0.51 vs 0.49 is very different from 0.9 vs 0.1
You may consider including the weather information on the day of the match. Such as the temperature and whether it's windy/sunny or not. That is also an important factor in soccer games. Also including the country of origin/continent of origin might be helpful since people from different region tend to have different tolerance of heat/cold.
It's interesting that the coefficient across different player is quite different. Maybe you should take a few sub-sample, run the linear regression again and see who they vary.
The goal of the project is to predict the outcome of soccer match using data including the player rating and betting information.
Here are a few thinks I like about the current method:
I like that you've ran linear regression so that you know what the baseline should be for the problem. Now you have a good estimate of the difficulty of the problem.
I like that you're already think about over-underfitting issue.
Here are a few suggestions I can think of
Instead of using discrete values {-1,1} to represent betting odd, maybe consider using the actual value because a 0.51 vs 0.49 is very different from 0.9 vs 0.1
You may consider including the weather information on the day of the match. Such as the temperature and whether it's windy/sunny or not. That is also an important factor in soccer games. Also including the country of origin/continent of origin might be helpful since people from different region tend to have different tolerance of heat/cold.
It's interesting that the coefficient across different player is quite different. Maybe you should take a few sub-sample, run the linear regression again and see who they vary.