estiens / world_cup_json

Rails backend for a scraper that outputs World Cup data as JSON
928 stars 107 forks source link

Duplicate goals in match data #231

Closed ipince closed 1 year ago

ipince commented 1 year ago

It seems that some goals are coming in as duplicate.

For example, here's the (truncated) response from today's Spain vs Japan game: https://worldcupjson.net/matches/43

// 20221201230017
// https://worldcupjson.net/matches/43

{
  "id": 43,
  ...
  "home_team_events": [
  ...
    {
      "id": 1666,
      "type_of_event": "goal",
      "player": "Ritsu Doan",
      "time": "48'",
      "extra_info": null
    },
    {
      "id": 1668,
      "type_of_event": "goal",
      "player": "Ao Tanaka",
      "time": "51'",
      "extra_info": null
    },
    {
      "id": 1667,                             # <-------- DUPLICATE
      "type_of_event": "goal",
      "player": "Ao Tanaka",
      "time": "53'",
      "extra_info": null
    },
  ],
  "away_team_events": [
    {
      "id": 1658,
      "type_of_event": "goal",
      "player": "Alvaro Morata",
      "time": "11'",
      "extra_info": null
    },
    {
      "id": 1657,                         # <--------------- DUPLICATE
      "type_of_event": "goal",
      "player": "Alvaro Morata",
      "time": "12'",
      "extra_info": null
    },
  ],
  ...
}

It happens fairly often. I think some other events may have duplicates too, but goals are the most important.

Where is the goal data coming from? My guess is that data is being merged from a couple different sources and they have a different time.. Maybe one of the data sources is bad and should be dropped? Or maybe use more data sources and dedupe somehow?

ipince commented 1 year ago

Well it seems the data comes from FIFA itself.. I tried playing around with the FIFA API urls to see if i could trace it back to their API, but I couldn't get the urls to work. Someone with some background should be able to check fairly quickly I think.

estiens commented 1 year ago

this can happen if there is a goal, then it is rescinded, and then it is made official again like in that match

there's no easy way to prevent this because there is no canonical key for such a thing. In past years we have gona back and finalized all events, but I think for now, we'll just have to accept that if a goal comes and goes and comes it might show up twice - but the score is not based ong oal events so should be okay