Spencer-Weston commented 3 years ago

Creating a Data Set

Goal: Create a data set where each row is keyed by a unique game-team statistic(s) where team statistics are aggregated from the player level .

The player (or unique player id) will be used to join that player to their associated statistics
In some form, the player-game combination needs to identify each players' team
We must have win-loss information for each game.

Data sources: NBA Games Data - This dataset uniquely associates every game with every player and their team from 2004 onward. Relevant data -

game_details.csv, a player id associated with a game for every game and every player
games.csv, holds the game date and home/visitor team id's
teams.csv, holds identifiers by team_id

Player Statistics - We can get career average statistics from this API endpoint

Returns value in .json format. Will need to be formatted into a dataframe.

Plan:

Raw Data

Download NBA games data csv's
Setup API endpoint, download career average statistics, store as .csv
Setup Season Scraper

Process Data

Run Season Scraper from 2004-2019 (NBA games data goes back to 2004; Pandemic affects data past 2019). Aggregate results into a single dataframe that holds every game and the result for the time period.
From NBA Games Data:
- join games.csv to teams.csv as A.csv (teams.csv has identifier needed to join with season scraper data)
- join A.csv to the data generated by season_scraper (the season scraper data lets us know which teams won or lost) as B.csv
- Join B.csv to game_details.csv as C.csv. C.csv holds the home team, away team, game result, and generates the game <--> player relationship
Using unique(players) from C.csv, extract career average statistics for each player. Parse the returned .json file into a dataframe. Save as .csv
Join player statistics to teams by game. (i.e. players a,b,c are associated with the Boston Celtics for game 74 of the 2009 year). This will be a lot of work, so it's hard to completely describe what this process will look like beforehand.

Spencer-Weston commented 3 years ago

Attempt 2

Here, I'm fleshing out more how we will get our data together for the regression with clusters.

Time Frame

The hand checking rule was changed for the 2004-2005 season. This is referred to as the 2005 season in some datasets and 2004-2005 in other datasets. We will use data up to the 2018-2019 or 2019 season.

Clustering:

This falls into two categories: Generating Clusters and Assigning Clusters. We first create our clusters. Then, we need a method for assigning players to clusters for use in the regression.

Generating Clusters

We will generate clusters from full season data provided by Weijia's player_stats.csv. Each row in the dataset will be a season of player stats.

Required Data Engineering:
- Extract data by time. Include seasons from 2004-2005 to 2018-2019.
  - In player_stats.ipynb, it's noted that there are nulls for several values. We will need to recheck for NA after we reduce our data to this time frame.
- TEAM_ABV = 'TOT', TEAM_ID = 0 - These values indicates that a player played for multiple teams in one season. When this is the case, we want to use the values provided in the TOT row as this indicates the player's statistics for the full season.
- If possible in a reasonable amount of time: Join the player_stats.csv data to the Player_Attributes table from the Kaggle dataset. Join Player_Attributes to player_stats.csv. Extract the height and weight columns. For null heights and weights, fill.na() using mean height and weight by position (Forward/Center/Guard).
Cluster Features:
- DROP: FGM, FG3M, FTM, REB, GP, MIN.
  - FGM, FG3M, FTM are linear combinations of (FGA, FG3A, FT_PCT) * (FG_PCT, FG3_PCT, FT_PCT) respectively.
  - REB is a linear combination of DREB + OREB.
  - We will compute MPG as a linear combination of GP and MIN
- COMPUTE: Minutes per game (MPG). MPG = MIN/GP (Minutes/games played).
- INCLUDE: MPG, GS, FGA, FG_PCT, FG3A, FG3_PCT, FTA, FT_PCT, OREB, DREB, AST, STL, BLK, TOV, PF, PTS
  - If possible, include height and weight

From here, we just run a clustering model starting with K-means and perhaps move to a Gaussian Mixture Model.

Assigning Clusters

Assigning clusters is trickier than generating clusters. Consider an arbitrary Player_1. What data do we use to assign Player_1 to a cluster for Game_X in Season_Y? If assign a cluster using Season_Y data, then Game_X has already been played and included in that season's statistics. The prior season works for players that played in the prior year, but it won't work for rookies or players who were injured in the prior year. An average of the prior ~3 seasons works for injured players, but won't work for rookies.

Proposal: Use a running average of each player's prior 82 games. For rookies, use some arbitrary function to assign them a value, such as the 20th quantile, for each statistic in each game until they've played 82 total games and their data is entirely generated by their performance. This will allow players to change clusters if their performance becomes significantly different as time goes on.
- This can be accomplished with game_details.csv. It's not a trivial DE issue, but I can do the work for that.

Supervised methods

Our model is Y = \beta TeamStats + \beta PlayerClusters. We will start with a regression then move towards more complicated models (random forests, neural nets, etc.).

Data

The Game table from the kaggle dataset will work for this. As with players, data for a game on date X needs to be aggregated from games before date X. Also, as with players, we run into an issue where we don't have data for Game 0 of a season. I propose the data for game 0 = the data from the previous season. Then, we incrementally decrease the weight of the prior season over 10 games. The 11th game of the season is composed of data from the prior 10 games and none of the previous season.

example: game 1 = previous season data (PSD), game 2 = 90% PSD + 10% game 1, game 3 = 80% PSD + 20% This Seasons Data (TSD), game 4 = 70% PSD + 20% TSD, . . ., Game 10 = 10% PSD + 90% TSD, Game 11= 0% PSD + 100% TSD
- In an ideal scenario, we'd use some sort of Bayesian model to update the data over time. We don't have time for that, so this should be a reasonable proxy. The R^2 between statistics at the start of a season and at the end of the season is over 0.5 at less than 10 games for most statistics.
The features in this data should be the same as the player clusters, or as close as possible.

Spencer-Weston commented 3 years ago

Matt pushes notebook to the repo.
Spencer uses notebook to assign clusters to to players in moving_average_player_stats.csv
- Get a count of cluster membership by team by game into .csv
Weija joins .csv from 2 to team_stats.csv

From there, we should be good to model.

Spencer-Weston / W207_Final_NBA_Prediction

Data Pipeline #1

Creating a Data Set

Attempt 2

Time Frame

Clustering:

Generating Clusters

Assigning Clusters

Supervised methods

Data