Spencer-Weston / W207_Final_NBA_Prediction

W207 Final project for predicting NBA games
MIT License
0 stars 0 forks source link

Data Pipeline #1

Open Spencer-Weston opened 2 years ago

Spencer-Weston commented 2 years ago

Creating a Data Set

Goal: Create a data set where each row is keyed by a unique game-team statistic(s) where team statistics are aggregated from the player level .

Data sources: NBA Games Data - This dataset uniquely associates every game with every player and their team from 2004 onward. Relevant data -

Player Statistics - We can get career average statistics from this API endpoint

Plan:

Raw Data

  1. Download NBA games data csv's
  2. Setup API endpoint, download career average statistics, store as .csv
  3. Setup Season Scraper

Process Data

  1. Run Season Scraper from 2004-2019 (NBA games data goes back to 2004; Pandemic affects data past 2019). Aggregate results into a single dataframe that holds every game and the result for the time period.
  2. From NBA Games Data:
    • join games.csv to teams.csv as A.csv (teams.csv has identifier needed to join with season scraper data)
    • join A.csv to the data generated by season_scraper (the season scraper data lets us know which teams won or lost) as B.csv
    • Join B.csv to game_details.csv as C.csv. C.csv holds the home team, away team, game result, and generates the game <--> player relationship
  3. Using unique(players) from C.csv, extract career average statistics for each player. Parse the returned .json file into a dataframe. Save as .csv
  4. Join player statistics to teams by game. (i.e. players a,b,c are associated with the Boston Celtics for game 74 of the 2009 year). This will be a lot of work, so it's hard to completely describe what this process will look like beforehand.
Spencer-Weston commented 2 years ago

Attempt 2

Here, I'm fleshing out more how we will get our data together for the regression with clusters.

Time Frame

The hand checking rule was changed for the 2004-2005 season. This is referred to as the 2005 season in some datasets and 2004-2005 in other datasets. We will use data up to the 2018-2019 or 2019 season.

Clustering:

This falls into two categories: Generating Clusters and Assigning Clusters. We first create our clusters. Then, we need a method for assigning players to clusters for use in the regression.

Generating Clusters

We will generate clusters from full season data provided by Weijia's player_stats.csv. Each row in the dataset will be a season of player stats.

From here, we just run a clustering model starting with K-means and perhaps move to a Gaussian Mixture Model.

Assigning Clusters

Assigning clusters is trickier than generating clusters. Consider an arbitrary Player_1. What data do we use to assign Player_1 to a cluster for Game_X in Season_Y? If assign a cluster using Season_Y data, then Game_X has already been played and included in that season's statistics. The prior season works for players that played in the prior year, but it won't work for rookies or players who were injured in the prior year. An average of the prior ~3 seasons works for injured players, but won't work for rookies.

Supervised methods

Our model is Y = \beta TeamStats + \beta PlayerClusters. We will start with a regression then move towards more complicated models (random forests, neural nets, etc.).

Data

The Game table from the kaggle dataset will work for this. As with players, data for a game on date X needs to be aggregated from games before date X. Also, as with players, we run into an issue where we don't have data for Game 0 of a season. I propose the data for game 0 = the data from the previous season. Then, we incrementally decrease the weight of the prior season over 10 games. The 11th game of the season is composed of data from the prior 10 games and none of the previous season.

Spencer-Weston commented 2 years ago
  1. Matt pushes notebook to the repo.
  2. Spencer uses notebook to assign clusters to to players in moving_average_player_stats.csv
    • Get a count of cluster membership by team by game into .csv
  3. Weija joins .csv from 2 to team_stats.csv

From there, we should be good to model.