Open Spencer-Weston opened 3 years ago
Here, I'm fleshing out more how we will get our data together for the regression with clusters.
The hand checking rule was changed for the 2004-2005 season. This is referred to as the 2005
season in some datasets and 2004-2005
in other datasets. We will use data up to the 2018-2019
or 2019
season.
This falls into two categories: Generating Clusters and Assigning Clusters. We first create our clusters. Then, we need a method for assigning players to clusters for use in the regression.
We will generate clusters from full season data provided by Weijia's player_stats.csv
. Each row in the dataset will be a season of player stats.
2004-2005
to 2018-2019
.
NA
after we reduce our data to this time frame. TEAM_ABV = 'TOT'
, TEAM_ID = 0
- These values indicates that a player played for multiple teams in one season. When this is the case, we want to use the values provided in the TOT
row as this indicates the player's statistics for the full season. Player_Attributes
table from the Kaggle dataset. Join Player_Attributes
to player_stats.csv. Extract the height and weight columns. For null heights and weights, fill.na() using mean height and weight by position (Forward/Center/Guard). From here, we just run a clustering model starting with K-means and perhaps move to a Gaussian Mixture Model.
Assigning clusters is trickier than generating clusters. Consider an arbitrary Player_1
. What data do we use to assign Player_1
to a cluster for Game_X
in Season_Y
? If assign a cluster using Season_Y
data, then Game_X
has already been played and included in that season's statistics. The prior season works for players that played in the prior year, but it won't work for rookies or players who were injured in the prior year. An average of the prior ~3 seasons works for injured players, but won't work for rookies.
Our model is Y = \beta TeamStats + \beta PlayerClusters. We will start with a regression then move towards more complicated models (random forests, neural nets, etc.).
The Game
table from the kaggle dataset will work for this. As with players, data for a game on date X needs to be aggregated from games before date X. Also, as with players, we run into an issue where we don't have data for Game 0 of a season. I propose the data for game 0 = the data from the previous season. Then, we incrementally decrease the weight of the prior season over 10 games. The 11th game of the season is composed of data from the prior 10 games and none of the previous season.
From there, we should be good to model.
Creating a Data Set
Goal: Create a data set where each row is keyed by a unique game-team statistic(s) where team statistics are aggregated from the player level .
Data sources: NBA Games Data - This dataset uniquely associates every game with every player and their team from 2004 onward. Relevant data -
Player Statistics - We can get career average statistics from this API endpoint
Plan:
Raw Data
Process Data