Torvaney / mezzala

Models for estimating football (soccer) team-strength
https://torvaney.github.io/mezzala/
Apache License 2.0
35 stars 1 forks source link

nan Issue for Some Predictions #2

Closed hedonistrh closed 2 years ago

hedonistrh commented 2 years ago

Hey @Torvaney, thanks a lot for great repo. I was using dixon-coles model for some calculations but realized that for some cases we can end up with 'nan' in some probabilities. I did try to some debugging but was not able to find what is the reason of that. I am sharing reproducible code as following

import mezzala
adapter = mezzala.KeyAdapter(
                home_team='home_team_name',
                away_team='away_team_name',
                home_goals='home_score',
                away_goals='away_score',
            )
# following is first 4 week of Bundesliga 2021-2022
previous_matches = [{'home_team_name': "Borussia M'Gladbach", 'away_team_name': 'Bayern München', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Arminia Bielefeld', 'away_team_name': 'SC Freiburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'FC Augsburg', 'away_team_name': 'TSG Hoffenheim', 'home_score': 0, 'away_score': 4}, {'home_team_name': 'Union Berlin', 'away_team_name': 'Bayer Leverkusen', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'VfB Stuttgart', 'away_team_name': 'Greuther Fürth', 'home_score': 5, 'away_score': 1}, {'home_team_name': 'Wolfsburg', 'away_team_name': 'VfL Bochum', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'Borussia Dortmund', 'away_team_name': 'Eintracht Frankfurt', 'home_score': 5, 'away_score': 2}, {'home_team_name': 'Mainz 05', 'away_team_name': 'RB Leipzig', 'home_score': 1, 'away_score': 0}, {'home_team_name': '1. FC Köln', 'away_team_name': 'Hertha BSC', 'home_score': 3, 'away_score': 1}, {'home_team_name': 'RB Leipzig', 'away_team_name': 'VfB Stuttgart', 'home_score': 4, 'away_score': 0}, {'home_team_name': 'VfL Bochum', 'away_team_name': 'Mainz 05', 'home_score': 2, 'away_score': 0}, {'home_team_name': 'Eintracht Frankfurt', 'away_team_name': 'FC Augsburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'SC Freiburg', 'away_team_name': 'Borussia Dortmund', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Greuther Fürth', 'away_team_name': 'Arminia Bielefeld', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Hertha BSC', 'away_team_name': 'Wolfsburg', 'home_score': 1, 'away_score': 2}, {'home_team_name': 'Bayer Leverkusen', 'away_team_name': "Borussia M'Gladbach", 'home_score': 4, 'away_score': 0}, {'home_team_name': 'TSG Hoffenheim', 'away_team_name': 'Union Berlin', 'home_score': 2, 'away_score': 2}, {'home_team_name': 'Bayern München', 'away_team_name': '1. FC Köln', 'home_score': 3, 'away_score': 2}, {'home_team_name': 'Borussia Dortmund', 'away_team_name': 'TSG Hoffenheim', 'home_score': 3, 'away_score': 2}, {'home_team_name': 'Arminia Bielefeld', 'away_team_name': 'Eintracht Frankfurt', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Augsburg', 'away_team_name': 'Bayer Leverkusen', 'home_score': 1, 'away_score': 4}, {'home_team_name': '1. FC Köln', 'away_team_name': 'Bochum', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Mainz 05', 'away_team_name': 'Greuther Fürth', 'home_score': 3, 'away_score': 0}, {'home_team_name': 'VfB Stuttgart', 'away_team_name': 'Freiburg', 'home_score': 2, 'away_score': 3}, {'home_team_name': 'Bayern München', 'away_team_name': 'Hertha BSC', 'home_score': 5, 'away_score': 0}, {'home_team_name': 'Union Berlin', 'away_team_name': "Borussia M'Gladbach", 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Wolfsburg', 'away_team_name': 'RB Leipzig', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'Bayer Leverkusen', 'away_team_name': 'Borussia Dortmund', 'home_score': 3, 'away_score': 4}, {'home_team_name': 'SC Freiburg', 'away_team_name': '1. FC Köln', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Greuther Fürth', 'away_team_name': 'Wolfsburg', 'home_score': 0, 'away_score': 2}, {'home_team_name': 'TSG Hoffenheim', 'away_team_name': 'Mainz 05', 'home_score': 0, 'away_score': 2}, {'home_team_name': 'Union Berlin', 'away_team_name': 'FC Augsburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'RB Leipzig', 'away_team_name': 'Bayern München', 'home_score': 1, 'away_score': 4}, {'home_team_name': 'Eintracht Frankfurt', 'away_team_name': 'VfB Stuttgart', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'VfL Bochum', 'away_team_name': 'Hertha BSC', 'home_score': 1, 'away_score': 3}, {'home_team_name': "Borussia M'Gladbach", 'away_team_name': 'Arminia Bielefeld', 'home_score': 3, 'away_score': 1}]

match_to_predict = {'home_team_name': 'Wolfsburg', 'away_team_name': 'Eintracht Frankfurt'}
scorelines = model.predict_one(match_to_predict, 6)
print (scorelines)

When we check scorelines, we can see following

ScorelinePrediction(home_goals=0, away_goals=1, probability=nan)
hedonistrh commented 2 years ago

One possible explanation is -- when I check model's parameter after fitting with that 4 weeks, Rho is -10105887.801111476. So that effect tau calculation significantly. 🤔 But still not sure about what is reason of that.

Torvaney commented 2 years ago

I think the issue is that some of the teams have inconsistent naming, leaving orphan teams:

import collections

collections.Counter(
  [m['away_team_name'] for m in previous_matches] + 
  [m['home_team_name'] for m in previous_matches]
)
Counter({'Bayern München': 4,
         'SC Freiburg': 3,
         'TSG Hoffenheim': 4,
         'Bayer Leverkusen': 4,
         'Greuther Fürth': 4,
         'VfL Bochum': 3,
         'Eintracht Frankfurt': 4,
         'RB Leipzig': 4,
         'Hertha BSC': 4,
         'VfB Stuttgart': 4,
         'Mainz 05': 4,
         'FC Augsburg': 3,
         'Borussia Dortmund': 4,
         'Arminia Bielefeld': 4,
         'Wolfsburg': 4,
         "Borussia M'Gladbach": 4,
         'Union Berlin': 4,
         '1. FC Köln': 4,
         'Bochum': 1,
         'Freiburg': 1,
         'Augsburg': 1})

I think it should work okay with consistent team names over this dataset


This does raise the question of how the model should handle non-identifiable datasets, since failing silently like this is not helpful - do you have any ideas?

hedonistrh commented 2 years ago

Thanks @Torvaney. I was using combination of two data-set and that explain why those names are inconsistent. Thanks a lot for spotting that issue. That is also helpful for other parts of my project. 💯

On the other hand, this did not solve that "nan" issue. I am sharing reproducible code with the consistent name

import mezzala
adapter = mezzala.KeyAdapter(
                home_team='home_team_name',
                away_team='away_team_name',
                home_goals='home_score',
                away_goals='away_score',
            )
# following is first 4 week of Bundesliga 2021-2022
previous_matches = [{'home_team_name': 'B. Monchengladbach', 'away_team_name': 'Bayern Munich', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Arminia Bielefeld', 'away_team_name': 'Freiburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'Augsburg', 'away_team_name': 'Hoffenheim', 'home_score': 0, 'away_score': 4}, {'home_team_name': 'Union Berlin', 'away_team_name': 'Bayer Leverkusen', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Stuttgart', 'away_team_name': 'Greuther Furth', 'home_score': 5, 'away_score': 1}, {'home_team_name': 'Wolfsburg', 'away_team_name': 'Bochum', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'Dortmund', 'away_team_name': 'Eintracht Frankfurt', 'home_score': 5, 'away_score': 2}, {'home_team_name': 'Mainz', 'away_team_name': 'RB Leipzig', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'FC Koln', 'away_team_name': 'Hertha Berlin', 'home_score': 3, 'away_score': 1}, {'home_team_name': 'RB Leipzig', 'away_team_name': 'Stuttgart', 'home_score': 4, 'away_score': 0}, {'home_team_name': 'Bochum', 'away_team_name': 'Mainz', 'home_score': 2, 'away_score': 0}, {'home_team_name': 'Eintracht Frankfurt', 'away_team_name': 'Augsburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'Freiburg', 'away_team_name': 'Dortmund', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Greuther Furth', 'away_team_name': 'Arminia Bielefeld', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Hertha Berlin', 'away_team_name': 'Wolfsburg', 'home_score': 1, 'away_score': 2}, {'home_team_name': 'Bayer Leverkusen', 'away_team_name': 'B. Monchengladbach', 'home_score': 4, 'away_score': 0}, {'home_team_name': 'Hoffenheim', 'away_team_name': 'Union Berlin', 'home_score': 2, 'away_score': 2}, {'home_team_name': 'Bayern Munich', 'away_team_name': 'FC Koln', 'home_score': 3, 'away_score': 2}, {'home_team_name': 'Dortmund', 'away_team_name': 'Hoffenheim', 'home_score': 3, 'away_score': 2}, {'home_team_name': 'Arminia Bielefeld', 'away_team_name': 'Eintracht Frankfurt', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Augsburg', 'away_team_name': 'Bayer Leverkusen', 'home_score': 1, 'away_score': 4}, {'home_team_name': 'FC Koln', 'away_team_name': 'Bochum', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Mainz', 'away_team_name': 'Greuther Furth', 'home_score': 3, 'away_score': 0}, {'home_team_name': 'Stuttgart', 'away_team_name': 'Freiburg', 'home_score': 2, 'away_score': 3}, {'home_team_name': 'Bayern Munich', 'away_team_name': 'Hertha Berlin', 'home_score': 5, 'away_score': 0}, {'home_team_name': 'Union Berlin', 'away_team_name': 'B. Monchengladbach', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Wolfsburg', 'away_team_name': 'RB Leipzig', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'Bayer Leverkusen', 'away_team_name': 'Dortmund', 'home_score': 3, 'away_score': 4}, {'home_team_name': 'Freiburg', 'away_team_name': 'FC Koln', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Greuther Furth', 'away_team_name': 'Wolfsburg', 'home_score': 0, 'away_score': 2}, {'home_team_name': 'Hoffenheim', 'away_team_name': 'Mainz', 'home_score': 0, 'away_score': 2}, {'home_team_name': 'Union Berlin', 'away_team_name': 'Augsburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'RB Leipzig', 'away_team_name': 'Bayern Munich', 'home_score': 1, 'away_score': 4}, {'home_team_name': 'Eintracht Frankfurt', 'away_team_name': 'Stuttgart', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Bochum', 'away_team_name': 'Hertha Berlin', 'home_score': 1, 'away_score': 3}, {'home_team_name': 'B. Monchengladbach', 'away_team_name': 'Arminia Bielefeld', 'home_score': 3, 'away_score': 1}]
model = mezzala.DixonColes(adapter=adapter)
model.fit(previous_matches)
match_to_predict = {'home_team_name': 'Wolfsburg', 'away_team_name': 'Eintracht Frankfurt'}
scorelines = model.predict_one(match_to_predict, 6)
print (scorelines)

We can still see following "nan" one

ScorelinePrediction(home_goals=0, away_goals=1, probability=nan)

Also sharing counter for that data

Counter({'Bayern Munich': 4,
         'Freiburg': 4,
         'Hoffenheim': 4,
         'Bayer Leverkusen': 4,
         'Greuther Furth': 4,
         'Bochum': 4,
         'Eintracht Frankfurt': 4,
         'RB Leipzig': 4,
         'Hertha Berlin': 4,
         'Stuttgart': 4,
         'Mainz': 4,
         'Augsburg': 4,
         'Dortmund': 4,
         'Arminia Bielefeld': 4,
         'Wolfsburg': 4,
         'B. Monchengladbach': 4,
         'Union Berlin': 4,
         'FC Koln': 4})
Torvaney commented 2 years ago

Ah yes, thanks for the clarification.

I think the core issue is that the model is actually underspecified. There is a constraint on Rho that isn't implemented (usually the optimisation proceeds fine without it) that can result in invalid probability estimates. In this case, the fact that Wolfsburg have only conceded 1 goal leads to they defence parameter being extremely low (about 0.00000004). At this point the Rho-adjustment is larger than the estimated probability of observing a 0-1 scoreline. This takes the probability negative, which is impossible, thus resulting in a nan probability.

As a workaround, I think the easiest way to amend the issue for now would be to manually reset Rho after the fact, perhaps to a value fit over a larger sample.

model.params[mezzala.RHO_KEY] = some_reasonable_value

Pretty hacky, I know.

hedonistrh commented 2 years ago

Thanks a lot for your answer. I did what you mentioned as hacky and it is working right now. Also as you suggested, when we use more data, we do not end up with "nan" probability as well. 💯 I am just sharing latest code about as hacky solution .

import mezzala
adapter = mezzala.KeyAdapter(
                home_team='home_team_name',
                away_team='away_team_name',
                home_goals='home_score',
                away_goals='away_score',
            )
# following is first 4 week of Bundesliga 2021-2022
previous_matches = [{'home_team_name': 'B. Monchengladbach', 'away_team_name': 'Bayern Munich', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Arminia Bielefeld', 'away_team_name': 'Freiburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'Augsburg', 'away_team_name': 'Hoffenheim', 'home_score': 0, 'away_score': 4}, {'home_team_name': 'Union Berlin', 'away_team_name': 'Bayer Leverkusen', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Stuttgart', 'away_team_name': 'Greuther Furth', 'home_score': 5, 'away_score': 1}, {'home_team_name': 'Wolfsburg', 'away_team_name': 'Bochum', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'Dortmund', 'away_team_name': 'Eintracht Frankfurt', 'home_score': 5, 'away_score': 2}, {'home_team_name': 'Mainz', 'away_team_name': 'RB Leipzig', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'FC Koln', 'away_team_name': 'Hertha Berlin', 'home_score': 3, 'away_score': 1}, {'home_team_name': 'RB Leipzig', 'away_team_name': 'Stuttgart', 'home_score': 4, 'away_score': 0}, {'home_team_name': 'Bochum', 'away_team_name': 'Mainz', 'home_score': 2, 'away_score': 0}, {'home_team_name': 'Eintracht Frankfurt', 'away_team_name': 'Augsburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'Freiburg', 'away_team_name': 'Dortmund', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Greuther Furth', 'away_team_name': 'Arminia Bielefeld', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Hertha Berlin', 'away_team_name': 'Wolfsburg', 'home_score': 1, 'away_score': 2}, {'home_team_name': 'Bayer Leverkusen', 'away_team_name': 'B. Monchengladbach', 'home_score': 4, 'away_score': 0}, {'home_team_name': 'Hoffenheim', 'away_team_name': 'Union Berlin', 'home_score': 2, 'away_score': 2}, {'home_team_name': 'Bayern Munich', 'away_team_name': 'FC Koln', 'home_score': 3, 'away_score': 2}, {'home_team_name': 'Dortmund', 'away_team_name': 'Hoffenheim', 'home_score': 3, 'away_score': 2}, {'home_team_name': 'Arminia Bielefeld', 'away_team_name': 'Eintracht Frankfurt', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Augsburg', 'away_team_name': 'Bayer Leverkusen', 'home_score': 1, 'away_score': 4}, {'home_team_name': 'FC Koln', 'away_team_name': 'Bochum', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Mainz', 'away_team_name': 'Greuther Furth', 'home_score': 3, 'away_score': 0}, {'home_team_name': 'Stuttgart', 'away_team_name': 'Freiburg', 'home_score': 2, 'away_score': 3}, {'home_team_name': 'Bayern Munich', 'away_team_name': 'Hertha Berlin', 'home_score': 5, 'away_score': 0}, {'home_team_name': 'Union Berlin', 'away_team_name': 'B. Monchengladbach', 'home_score': 2, 'away_score': 1}, {'home_team_name': 'Wolfsburg', 'away_team_name': 'RB Leipzig', 'home_score': 1, 'away_score': 0}, {'home_team_name': 'Bayer Leverkusen', 'away_team_name': 'Dortmund', 'home_score': 3, 'away_score': 4}, {'home_team_name': 'Freiburg', 'away_team_name': 'FC Koln', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Greuther Furth', 'away_team_name': 'Wolfsburg', 'home_score': 0, 'away_score': 2}, {'home_team_name': 'Hoffenheim', 'away_team_name': 'Mainz', 'home_score': 0, 'away_score': 2}, {'home_team_name': 'Union Berlin', 'away_team_name': 'Augsburg', 'home_score': 0, 'away_score': 0}, {'home_team_name': 'RB Leipzig', 'away_team_name': 'Bayern Munich', 'home_score': 1, 'away_score': 4}, {'home_team_name': 'Eintracht Frankfurt', 'away_team_name': 'Stuttgart', 'home_score': 1, 'away_score': 1}, {'home_team_name': 'Bochum', 'away_team_name': 'Hertha Berlin', 'home_score': 1, 'away_score': 3}, {'home_team_name': 'B. Monchengladbach', 'away_team_name': 'Arminia Bielefeld', 'home_score': 3, 'away_score': 1}]
model = mezzala.DixonColes(adapter=adapter)
model.fit(previous_matches)
model.params[mezzala.RHO_KEY] = 0.25
match_to_predict = {'home_team_name': 'Wolfsburg', 'away_team_name': 'Eintracht Frankfurt'}
scorelines = model.predict_one(match_to_predict, 6)
print (scorelines)

Ps. I really liked your Statsbomb conference. Thanks for preparing that and putting online. 🙏🏼

Torvaney commented 2 years ago

Thanks, @hedonistrh! I'm going to close this issue for now. I have raised a new one (#3) for the missing constraint.