Closed nbarlowATI closed 3 years ago
Hi, I have been reading up a bit on this project and would like to contribute more to this project. I was thinking maybe I can help collect this data? I also checked that football-data API has the minute information for goals scored and the substitution information too. I can try and integrate both these things if that's okay?
Sure, that would be great @chahak13 ! Let us know if there's anything we can help with.
Hi, I wanted your opinion on where exactly do we want to store the goal-scoring time data? In the player details JSON or in the fixture information CSV? I guess since the goal data is being stored for individual players which is then being summed, this info would make sense to be put in the individual player info itself but I just wanted to confirm it once.
Right, good question. I think we need two things (which can be treated and saved separately):
As a starting point I think it would be fine (and maybe preferable actually) for these to be two separate new files (per season, and back to the 15/16 season if possible eventually), rather than incorporating them in the existing files.
Ultimately we'll need to decide where to add these in the database (which is defined in airsenal/framework/schema.py
). Maybe:
Goal
table with fields fixture
, minute
, team
, player
.minute_started
and minute_finished
fields added to the PlayerScore
table.Also no need to do all of this by the way (though you're welcome to!), e.g. a pull request with working API queries would be a great start.
Okay, got it. Thanks! I'll start working on getting the goal scored time first. There is one small issue though, that I did not foresee. The football-data.org API doesn't have this information in the free tier but in the free+three tier which is 25 EUR/month. If we do have that subscription, then we can get the time scored, the player who scored, and substitutions information in the same response. Can you please suggest how should I proceed?
Ah right, that's a shame. It's not something we have a subscription for (and we want AIrsenal to be as accessible as possible). Really we'd want something that's free and we can use the data as we like. I'll try to have a search around to see if there are any other sources that look promising.
Understat might be worth a try, I'm not sure they have an API but people have got data from there if you search around: https://understat.com/ (usually for expected goal stats but they might have other info available)
That's great, thanks! I'll look at that and also see if I can find something else.
If you do look at Understat I know this repo has some data available: https://github.com/vaastav/Fantasy-Premier-League (not what we need for this but may be a helpful start, e.g. this script and an example file).
Time of subs and goals is available on the web page for each match (e.g. https://understat.com/match/16435) but probably not in an easy to read/parse format.
I did take a look at the repo you mentioned. I believe that's also the source for the data files that we're currently looking at? There's also another understat python library which provides an interface to understat but it doesn't seem to be pulling in goal information. Scraping understat was also what I was going to try next.
I did find this API which has the event data that we want but seems to be providing 100 requests/day in the free plan. I'm not sure if that will be enough for us to repopulate all the previous data but maybe for new matchdays it might be sufficient? What are your thoughts about it?
PS: Sorry but I'll not be able to work on this much for the next two days since I have a couple of assignments going on right now.
No rush in getting this done, good luck with the assignments!
API-football looks interesting, I think I've come across it before but haven't used it. 100 requests/day is probably enough to do something useful (assuming we only need one request per match), we could set up a script to slowly gather the data for past seasons (would take about 4 days per season). We should check they'd allow us to save data from there in the repo, though (I can do that once we confirm the API does have what we need).
Looking at the docs it might be possible to request more than one fixture at a time. If that's the case the 100 requests would go a long way.
@jack89roberts While the fixtures/
endpoint allows us to get information related to multiple fixtures at the same time, the fixtures/events
API allows only one request per fixture from what I could understand. I have a mock version running for both goals and substitutions info.
import json
import requests
from pprint import pprint
endpoint = "https://v3.football.api-sports.io/fixtures"
headers = {
'x-rapidapi-host': "v3.football.api-sports.io",
'x-rapidapi-key': "xxx"
}
params = {
"league": 39,
"season": 2021,
"last": 4
}
response = requests.get(endpoint, params=params, headers=headers)
if response.ok:
result = response.json()
fixtures = {}
for fixture in result.get("response"):
idx = fixture.get("fixture").get("id")
home = fixture.get("teams").get("home").get("name")
away = fixture.get("teams").get("away").get("name")
fixtures[idx] = {"home": home, "away": away, "score": fixture.get("goals")}
goal_info = {}
sub_info = {}
event_endpoint = "https://v3.football.api-sports.io/fixtures/events"
for fixture_id, info in fixtures.items():
goal_info[fixture_id] = []
sub_info[fixture_id] = []
params = {"fixture": fixture_id}
event_response = requests.get(event_endpoint, params=params, headers=headers)
if event_response.ok:
events = event_response.json()
for event in events.get("response"):
event_type = event.get("type")
if event_type == "Goal":
temp = {
"team": event.get("team").get("name"),
"goal_time": event.get("time").get("elapsed"),
"scorer": event.get("player"),
"assist": event.get("assist"),
}
goal_info[fixture_id].append(temp)
elif event_type == "subst":
temp = {
"team": event.get("team").get("name"),
"in": event.get("assist"),
"out": event.get("player"),
}
sub_info[fixture_id].append(temp)
print("Goal Info:")
pprint(goal_info)
print("\n\nSub Info:")
pprint(sub_info)
Furthermore, I have also attached a text file documenting the results/responses of the API. I did not want to upload it to git right now because of the concern you had brought up last time about them allowing it or not. If this looks good, then I can work on integrating it into the data we have and create a PR. Let me know.
Thanks @chahak13 that's really great work! It looks like it mostly fits what we need, but are the times for substitutes available?
A potentially bigger caveat is I tried registering for an account myself and noticed it asks for payment details, even for the free subscription. And if you exceed the free quota it's possible you'll be charged ("overage charges" - see here). I've removed the API key from your comment and text file above so others can't pick it up and use it (but you may want to deactivate that one and make a new one yourself to be safe).
I don't think this is compatible with us wanting to keep AIrsenal completely free and open, unfortunately (unless you found a way to register without payment details?)
I’m sorry, I missed removing the key entirely. Yes, the times for the substitutions are also available. I just did not store them right now.
Also, yeah, it is true, I had to submit my PayPal details to get the free subscription. I opted the PayPal option just so that the overage charge that you mention doesn’t apply, but I did have to connect it to paypal. Let me try and find if there’s an entirely free way to do this, otherwise I’ll try going back to scraping understat.com site.
Hey @jack89roberts, I decided on scraping the understat site to get the information we need. I've used BeautifulSoup to do it right now, it's slightly slow but I still got all the info for EPL 2020 in 4 mins so it's not that bad either. I've attached the information as a JSON file. Let me know your thoughts about it. epl_2020.txt
Example:
{
match_id: {
"home": home_team,
"away": away_team,
"goals": [
[
goal_scorer,
goal_time
]
],
"subs": [
[
player_out,
player_in,
sub_time
]
]
}
}
{
"14086": {
"home": "Fulham",
"away": "Arsenal",
"goals": [
[
"Alexandre Lacazette",
"8"
],
[
"Gabriel",
"48"
],
[
"Pierre-Emerick Aubameyang",
"56"
]
],
"subs": [
[
"Neeskens Kebano",
"Franck Zambo",
"64"
],
[
"Josh Onomah",
"Bobby Reid",
"76"
],
[
"Willian",
"Nicolas Pepe",
"77"
],
[
"Granit Xhaka",
"Dani Ceballos",
"80"
],
[
"Alexandre Lacazette",
"Eddie Nketiah",
"88"
]
]
},
"14087": {
"home": "Crystal Palace",
"away": "Southampton",
"goals": [
[
"Wilfried Zaha",
"12"
]
],
"subs": [
[
"Jan Bednarek",
"Jannik Vestergaard",
"48"
],
[
"James McCarthy",
"Luka Milivojevic",
"76"
],
[
"William Smallbone",
"Moussa Djenepo",
"79"
],
[
"Jeffrey Schlupp",
"Eberechi Eze",
"83"
],
[
"Che Adams",
"Shane Long",
"87"
]
]
},
"14090": {
"home": "Liverpool",
"away": "Leeds",
"goals": [
[
"Jack Harrison",
"11"
],
[
"Virgil van Dijk",
"19"
],
[
"Patrick Bamford",
"29"
],
[
"Mohamed Salah",
"32"
],
[
"Mateusz Klich",
"65"
]
],
"subs": [
[
"Naby Keita",
"Fabinho",
"60"
],
[
"Patrick Bamford",
"Rodrigo",
"64"
],
[
"Jordan Henderson",
"Curtis Jones",
"68"
],
[
"Mateusz Klich",
"Jamie Shackleton",
"83"
],
[
"Trent Alexander-Arnold",
"Joel Matip",
"91"
]
]
},
Nice looks good, definitely would be happy to have a pull request for this! Couple of small comments (but they don't have to be done for the PR):
I also noticed there's an Understat Python package: https://github.com/amosbastian/understat . I don't think they have this data available so you could see if they'd be interested having this contributed over there too (but definitely do put it in an AIrsenal pull request 😄 )
That's great. I'll polish it up a bit and raise a PR. I'll start thinking about the other things and maybe raise another PR once I get on it.
Currently from the FPL API we have no way of knowing:
This means that we don't know whether a player was on the pitch when a goal was scored, which would be useful information for evaluating their defensive strength.
Is there some other API we could get this from? (maybe https://www.football-data.org/ or similar?) Or scrape from match reports?