HarryShomer / Hockey-Scraper

Python Package for scraping NHL Play-by-Play and Shift data
http://hockey-data.harryshomer.com
GNU General Public License v3.0
141 stars 47 forks source link

Json pbp scraping gives incorrect xC/yC data. #38

Open JB13 opened 6 months ago

JB13 commented 6 months ago

I was pulling down data (using both scrape_seasons and scrape_games) and I noticed that ~30% of shots either had no xC/yC data, or just had it listed at one of the bullet points:

image

Looking into it a bit, it looks like the "eventID" values in the json are no longer guaranteed to be in order (See snippet of json below). In json_pbp.py, I removed the sorted_events logic, and get data in the "right" order:

image

Not sorting seems to work mostly? Still need to investigate cases where html event length != json event length. Sorting by seconds_elapsed doesn't work great for stoppages, then faceoffs at the same time point.

I'll might have time to try to find a more elegant fix to this (and maybe adding a test that grabs a couple plays from a game to confirm it's being parsed correctly in the future). But wanted to write this down/make note of it in case anyone else is looking at it.

"plays": [
        {
            "eventId": 102,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:00",
            "timeRemaining": "20:00",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 520,
            "typeDescKey": "period-start",
            "sortOrder": 8
        },
        {
            "eventId": 101,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:00",
            "timeRemaining": "20:00",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 502,
            "typeDescKey": "faceoff",
            "sortOrder": 9,
            "details": {
                "eventOwnerTeamId": 18,
                "losingPlayerId": 8478519,
                "winningPlayerId": 8475158,
                "xCoord": 0,
                "yCoord": 0,
                "zoneCode": "N"
            }
        },
        {
            "eventId": 8,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:35",
            "timeRemaining": "19:25",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 516,
            "typeDescKey": "stoppage",
            "sortOrder": 15,
            "details": {
                "reason": "icing"
            }
        },
        {
            "eventId": 103,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:35",
            "timeRemaining": "19:25",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 502,
            "typeDescKey": "faceoff",
            "sortOrder": 17,
            "details": {
                "eventOwnerTeamId": 14,
                "losingPlayerId": 8476925,
                "winningPlayerId": 8478519,
                "xCoord": -69,
                "yCoord": 22,
                "zoneCode": "D"
            }
        },
        {
            "eventId": 9,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:48",
            "timeRemaining": "19:12",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 503,
            "typeDescKey": "hit",
            "sortOrder": 20,
            "details": {
                "xCoord": 64,
                "yCoord": 42,
                "zoneCode": "D",
                "eventOwnerTeamId": 18,
                "hittingPlayerId": 8474568,
                "hitteePlayerId": 8476453
            }
        },
HarryShomer commented 6 months ago

Looking into it a bit, it looks like the "eventID" values in the json are no longer guaranteed to be in order (See snippet of json below).

@JB13 Thanks for the heads up. That's a bummer.

Looking at the JSON your provided, I wonder what "sortOrder" represents. That's seems to be increasing for each subsequent event. That might work, though I have no idea what the actual value represents.