erilu / web-scraping-NBA-statistics

Use Python to scrape ESPN for stats on all players in the NBA. Obtain and organize data, calculate statistics, and model using urllib, re, pandas, and scikit-learn.
https://erilu.github.io/web-scraping-NBA-statistics/
16 stars 9 forks source link

Web scraping in Python: gathering and analyzing NBA player stats

Erick Lu

March 31, 2020 - Jupyter Notebook

Introduction

In this project, I use Python to “scrape” ESPN for stats on all the players in the NBA, clean and organize the data into a data science-friendly format, and calculate some interesting statistics. Web scraping is a useful technique for extracting data from websites that don’t offer formatted, raw data for download.

As an example, I will be scraping data from the rosters of each team in the NBA for information such as player age, height, weight, and salary. I will also loop through each individual player's stats page and extract career averages such as points per game, free throw percentages, and more (as of currently, March 2020).

We can use this data to answer questions such as:

I've exported the data to a nicely organized csv file, accessible in the GitHub repo for this project, in case you would like to analyze it yourself. You can also run the python script scrape_nba_statistics.py to re-scrape ESPN for up-to-date data.

In the following sections, I will describe how to loop through ESPN page sources using urllib, extract information using re (regular expressions), organize player statistics in pandas DataFrames, and perform some simple modeling using scikit-learn.

Scrape roster information for each NBA team

We will first take a look at the structure of the website and figure out which web pages we need to scrape information from. The teams page at https://www.espn.com/nba/teams looks like the following:

ESPN_teams_webpage.png

This looks very promising. All the teams are listed on this page, which means that they can easily be extracted from the page source. Let’s take a look at the page source to see if we can find URLs for each team's roster:

ESPN_teams_source.png

It looks like URLs for each of the teams rosters are listed in the page source with the following format: https://www.espn.com/nba/team/roster/_/name/team/team-name, as shown in the highlighted portion of the image above. Given that these all follow the same format, we can use regular expressions to pull out a list of all the team names from the page source, and then construct the roster URLs using the format above. Start by importing the urllib and re packages in Python:

import re
import urllib
from time import sleep

Now, let’s create a function that will extract all the team names from http://www.espn.com/nba/teams and construct roster URLs for each of the teams:

# This method finds the urls for each of the rosters in the NBA using regexes.
def build_team_urls():
    # Open the espn teams webpage and extract the names of each roster available.
    f = urllib.request.urlopen('https://www.espn.com/nba/teams')
    teams_source = f.read().decode('utf-8')
    teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", teams_source))
    # Using the names of the rosters, create the urls of each roster
    roster_urls = []
    for key in teams.keys():
        # each roster webpage follows this general pattern.
        roster_urls.append('https://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams[key])
        teams[key] = str(teams[key])
    return dict(zip(teams.values(), roster_urls))
rosters = build_team_urls()
rosters
{'atlanta-hawks': 'https://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks',
 'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'charlotte-hornets': 'https://www.espn.com/nba/team/roster/_/name/cha/charlotte-hornets',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'dallas-mavericks': 'https://www.espn.com/nba/team/roster/_/name/dal/dallas-mavericks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'golden-state-warriors': 'https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors',
 'houston-rockets': 'https://www.espn.com/nba/team/roster/_/name/hou/houston-rockets',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'la-clippers': 'https://www.espn.com/nba/team/roster/_/name/lac/la-clippers',
 'los-angeles-lakers': 'https://www.espn.com/nba/team/roster/_/name/lal/los-angeles-lakers',
 'memphis-grizzlies': 'https://www.espn.com/nba/team/roster/_/name/mem/memphis-grizzlies',
 'miami-heat': 'https://www.espn.com/nba/team/roster/_/name/mia/miami-heat',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'minnesota-timberwolves': 'https://www.espn.com/nba/team/roster/_/name/min/minnesota-timberwolves',
 'new-orleans-pelicans': 'https://www.espn.com/nba/team/roster/_/name/no/new-orleans-pelicans',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'oklahoma-city-thunder': 'https://www.espn.com/nba/team/roster/_/name/okc/oklahoma-city-thunder',
 'orlando-magic': 'https://www.espn.com/nba/team/roster/_/name/orl/orlando-magic',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'phoenix-suns': 'https://www.espn.com/nba/team/roster/_/name/phx/phoenix-suns',
 'portland-trail-blazers': 'https://www.espn.com/nba/team/roster/_/name/por/portland-trail-blazers',
 'sacramento-kings': 'https://www.espn.com/nba/team/roster/_/name/sac/sacramento-kings',
 'san-antonio-spurs': 'https://www.espn.com/nba/team/roster/_/name/sa/san-antonio-spurs',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'utah-jazz': 'https://www.espn.com/nba/team/roster/_/name/utah/utah-jazz',
 'washington-wizards': 'https://www.espn.com/nba/team/roster/_/name/wsh/washington-wizards'}

The function build_team_urls() returns a dictionary that matches team names with their corresponding roster URL. Given this information, we can systematically loop through all of the rosters and use regular expressions to extract player information for each team.

In order to figure out how to scrape the rosters, let’s take a look at the Golden State Warriors' roster page as an example:

GSW_roster_webpage.png

Information for each player is nicely laid out in a table, meaning that the data is likely obtainable using regular expressions. Taking a look at the page source reveals that each player’s name and information are all provided in blocks of what apppear to be json, highlighted below:

GSW_roster_source.png

Given the standardized format of the data for each player, this information is indeed extractable using regular expressions. First, we should read in the roster webpage using urllib.request.urlopen:

url = "https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors"
f = urllib.request.urlopen(url)
roster_source = f.read().decode('utf-8')

Then, we construct the regex that will return information for each of the players on the roster webpage.

player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"https?\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_regex
player_info = re.findall(player_regex, roster_source)
player_info[0:4]
[('Ky Bowman',
  '"uid":"s:40~l:46~a:4065635","guid":"d0ef63e951bb5f842b7357521697dc62","id":"4065635","height":"6\' 1\\"","weight":"187 lbs","age":22,"position":"PG","jersey":"12","salary":"$350,189","birthDate":"06/17/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/4065635.png","lastName":"Ky Bowman","experience":0,"college":"Boston College"'),
 ('Marquese Chriss',
  '"uid":"s:40~l:46~a:3907487","guid":"a320ecf1d6481b7518ddc1dc576c27b4","id":"3907487","height":"6\' 9\\"","weight":"240 lbs","age":22,"position":"C","jersey":"32","salary":"$654,469","birthDate":"07/02/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3907487.png","lastName":"Marquese Chriss","experience":3,"college":"Washington","birthPlace":"Sacramento, CA"'),
 ('Stephen Curry',
  '"uid":"s:40~l:46~a:3975","guid":"5dda51f150c966e12026400b73f34fad","id":"3975","height":"6\' 3\\"","weight":"185 lbs","age":32,"position":"PG","jersey":"30","salary":"$40,231,758","birthDate":"03/14/88","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3975.png","lastName":"Stephen Curry","experience":10,"college":"Davidson","birthPlace":"Akron, OH"'),
 ('Draymond Green',
  '"uid":"s:40~l:46~a:6589","guid":"de360720e41625f28a6bb5ff82616cb1","id":"6589","height":"6\' 6\\"","weight":"230 lbs","age":30,"position":"PF","jersey":"23","salary":"$18,539,130","birthDate":"03/04/90","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/6589.png","lastName":"Draymond Green","experience":7,"college":"Michigan State","birthPlace":"Saginaw, MI"')]

As you can see, player_info is a list of tuples, in which each player name is paired with a set of information (height, weight, age, etc.) that is organized in json format. We can use the json package in Python to convert the information into a Python dictionary:

import json
draymond = json.loads("{"+player_info[3][1]+"}")
draymond
{'age': 30,
 'birthDate': '03/04/90',
 'birthPlace': 'Saginaw, MI',
 'college': 'Michigan State',
 'experience': 7,
 'guid': 'de360720e41625f28a6bb5ff82616cb1',
 'headshot': 'https://a.espncdn.com/i/headshots/nba/players/full/6589.png',
 'height': '6\' 6"',
 'id': '6589',
 'jersey': '23',
 'lastName': 'Draymond Green',
 'position': 'PF',
 'salary': '$18,539,130',
 'uid': 's:40~l:46~a:6589',
 'weight': '230 lbs'}

In the example above, all of the pertinent information for Draymond Green is now stored into a Python dictionary named draymond. Let's use the snippets of code above to construct a function which loops through each player in a given roster and scrapes their information:

def get_player_info(roster_url):
    f = urllib.request.urlopen(roster_url)
    roster_source = f.read().decode('utf-8')
    sleep(0.5)
    player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"https?\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
    player_info = re.findall(player_regex, roster_source)
    player_dict = dict()
    for player in player_info:
        player_dict[player[0]] = json.loads("{"+player[1]+"}")
    return(player_dict)

We can now loop through each team in rosters and run get_player_info(), storing the output in a dictionary called all_players:

all_players = dict()
for team in rosters.keys():
    print("Gathering player info for team: " + team)
    all_players[team] = get_player_info(rosters[team])
Gathering player info for team: boston-celtics
Gathering player info for team: brooklyn-nets
Gathering player info for team: new-york-knicks
Gathering player info for team: philadelphia-76ers
Gathering player info for team: toronto-raptors
Gathering player info for team: chicago-bulls
Gathering player info for team: cleveland-cavaliers
Gathering player info for team: detroit-pistons
Gathering player info for team: indiana-pacers
Gathering player info for team: milwaukee-bucks
Gathering player info for team: atlanta-hawks
Gathering player info for team: charlotte-hornets
Gathering player info for team: miami-heat
Gathering player info for team: orlando-magic
Gathering player info for team: washington-wizards
Gathering player info for team: denver-nuggets
Gathering player info for team: minnesota-timberwolves
Gathering player info for team: oklahoma-city-thunder
Gathering player info for team: portland-trail-blazers
Gathering player info for team: utah-jazz
Gathering player info for team: golden-state-warriors
Gathering player info for team: la-clippers
Gathering player info for team: los-angeles-lakers
Gathering player info for team: phoenix-suns
Gathering player info for team: sacramento-kings
Gathering player info for team: dallas-mavericks
Gathering player info for team: houston-rockets
Gathering player info for team: memphis-grizzlies
Gathering player info for team: new-orleans-pelicans
Gathering player info for team: san-antonio-spurs

After running this code, the all_players dictionary should be a dictionary of dictionaries of dictionaries. This sounds complicated, but let's walk through what it looks like. The first level of keys should correspond to teams:

all_players.keys()
dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])

Within a team, the keys should correspond to player names. Let's zoom in on the LA Lakers:

all_players["los-angeles-lakers"].keys()
dict_keys(['Kostas Antetokounmpo', 'Avery Bradley', 'Devontae Cacok', 'Alex Caruso', 'Quinn Cook', 'Anthony Davis', 'Jared Dudley', 'Danny Green', 'Dwight Howard', 'LeBron James', 'Kyle Kuzma', 'JaVale McGee', 'Markieff Morris', 'Rajon Rondo', 'Dion Waiters'])

Now we can choose which player to look at. Let's choose LeBron James as an example:

all_players["los-angeles-lakers"]["LeBron James"]
{'age': 35,
 'birthDate': '12/30/84',
 'birthPlace': 'Akron, OH',
 'experience': 16,
 'guid': '1f6592b3ff53d3218dc56038d48c1786',
 'headshot': 'https://a.espncdn.com/i/headshots/nba/players/full/1966.png',
 'height': '6\' 9"',
 'id': '1966',
 'jersey': '23',
 'lastName': 'LeBron James',
 'position': 'SF',
 'salary': '$37,436,858',
 'uid': 's:40~l:46~a:1966',
 'weight': '250 lbs'}

A dictionary with information about LeBron James is returned. We can extract information even more precisely by specifying which field we are interested in. Let's get his salary:

all_players["los-angeles-lakers"]["LeBron James"]["salary"]
'$37,436,858'

In order to make data analysis easier, we can re-format this dictionary into a pandas DataFrame. The function pd.DataFrame.from_dict() can turn a dictionary of dictionaries into a pandas DataFrame, as demonstrated below:

import pandas as pd
gsw = pd.DataFrame.from_dict(all_players["golden-state-warriors"], orient = "index")
gsw
uid guid id height weight age position jersey salary birthDate headshot lastName experience college birthPlace hand
Alen Smailagic s:40~l:46~a:4401415 6ed3f8924bfef2e70329ebd6a104ecae 4401415 6' 10" 215 lbs 19 PF 6 $898,310 08/18/00 https://a.espncdn.com/i/headshots/nba/players/... Alen Smailagic 0 NaN NaN NaN
Andrew Wiggins s:40~l:46~a:3059319 064c19d065276a21ca99fdfb296fe05d 3059319 6' 7" 197 lbs 25 SF 22 $27,504,630 02/23/95 https://a.espncdn.com/i/headshots/nba/players/... Andrew Wiggins 5 Kansas Thornhill, ON NaN
Chasson Randle s:40~l:46~a:2580898 71b7154a3d81842448b623ee3e65d586 2580898 6' 2" 185 lbs 27 PG 25 NaN 02/05/93 https://a.espncdn.com/i/headshots/nba/players/... Chasson Randle 2 Stanford Rock Island, IL NaN
Damion Lee s:40~l:46~a:2595209 41fafb6d47a66d8f79f94161918541a4 2595209 6' 5" 210 lbs 27 SG 1 $842,327 10/21/92 https://a.espncdn.com/i/headshots/nba/players/... Damion Lee 2 Louisville NaN L
Draymond Green s:40~l:46~a:6589 de360720e41625f28a6bb5ff82616cb1 6589 6' 6" 230 lbs 30 PF 23 $18,539,130 03/04/90 https://a.espncdn.com/i/headshots/nba/players/... Draymond Green 7 Michigan State Saginaw, MI NaN
Eric Paschall s:40~l:46~a:3133817 b67e5e0fa5cb209355845d165a49407e 3133817 6' 6" 255 lbs 23 PF 7 $898,310 11/04/96 https://a.espncdn.com/i/headshots/nba/players/... Eric Paschall 0 Villanova North Tarrytown, NY NaN
Jordan Poole s:40~l:46~a:4277956 4b0492b5a52f267fe84098ef6d2e2bdf 4277956 6' 4" 194 lbs 20 SG 3 $1,964,760 06/19/99 https://a.espncdn.com/i/headshots/nba/players/... Jordan Poole 0 Michigan Milwaukee, WI B
Kevon Looney s:40~l:46~a:3155535 10a8e77b877324c69966f0c4618caad6 3155535 6' 9" 222 lbs 24 PF 5 $4,464,226 02/06/96 https://a.espncdn.com/i/headshots/nba/players/... Kevon Looney 4 UCLA Milwaukee, WI NaN
Klay Thompson s:40~l:46~a:6475 3411530a7ab7e8dce4f165d59a559520 6475 6' 6" 215 lbs 30 SG 11 $32,742,000 02/08/90 https://a.espncdn.com/i/headshots/nba/players/... Klay Thompson 8 Washington State Los Angeles, CA NaN
Ky Bowman s:40~l:46~a:4065635 d0ef63e951bb5f842b7357521697dc62 4065635 6' 1" 187 lbs 22 PG 12 $350,189 06/17/97 https://a.espncdn.com/i/headshots/nba/players/... Ky Bowman 0 Boston College NaN NaN
Marquese Chriss s:40~l:46~a:3907487 a320ecf1d6481b7518ddc1dc576c27b4 3907487 6' 9" 240 lbs 22 C 32 $654,469 07/02/97 https://a.espncdn.com/i/headshots/nba/players/... Marquese Chriss 3 Washington Sacramento, CA NaN
Mychal Mulder s:40~l:46~a:3936298 f5a46c489e9aee6a1a74f67f9494132f 3936298 6' 4" 184 lbs 25 G 12 NaN 06/12/94 https://a.espncdn.com/i/headshots/nba/players/... Mychal Mulder 0 Kentucky Toronto, ON NaN
Stephen Curry s:40~l:46~a:3975 5dda51f150c966e12026400b73f34fad 3975 6' 3" 185 lbs 32 PG 30 $40,231,758 03/14/88 https://a.espncdn.com/i/headshots/nba/players/... Stephen Curry 10 Davidson Akron, OH NaN

In the DataFrame above, each of the parameters such as 'age', 'salary', etc. are organized in columns and each player is a row. This makes display of the data much easier to read and understand. Furthermore, it also places null values when pieces of data are missing--for example, Chasson Randle's salary information is missing from the website, so 'NaN' is automatically placed in the DataFrame.

DataFrames allow us to quickly make calculations, sort players based on their stats, and compare stats between teams. To make a DataFrame containing data from all the teams, we will loop through each team in all_players, construct DataFrames, label them with a team column, and aggregate them into a single DataFrame called all_players_df.

all_players_df = pd.DataFrame()
# loop through each team, create a pandas DataFrame, and append
for team in all_players.keys():
    team_df = pd.DataFrame.from_dict(all_players[team], orient = "index")
    team_df['team'] = team
    all_players_df = all_players_df.append(team_df)
all_players_df.head(5)
age birthDate birthPlace college experience guid hand headshot height id jersey lastName position salary team uid weight
Brad Wanamaker 30 07/25/89 Philadelphia, PA Pittsburgh 1 5aad35bbbb760e3958107639266768ae NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 3" 6507 9 Brad Wanamaker PG $1,445,697 boston-celtics s:40~l:46~a:6507 210 lbs
Carsen Edwards 22 03/12/98 Houston, TX Purdue 0 4b8ebdfd01221567925035c1e0d0c337 NaN https://a.espncdn.com/i/headshots/nba/players/... 5' 11" 4066407 4 Carsen Edwards PG $1,228,026 boston-celtics s:40~l:46~a:4066407 200 lbs
Daniel Theis 27 04/04/92 Germany NaN 2 ce75206c087f83ace6f9a8e3efbd9671 NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 8" 2451037 27 Daniel Theis C $5,000,000 boston-celtics s:40~l:46~a:2451037 245 lbs
Enes Kanter 27 05/20/92 Switzerland Kentucky 8 1e039b407b3daa6eeac69432aa6413fd NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 10" 6447 11 Enes Kanter C $4,767,000 boston-celtics s:40~l:46~a:6447 250 lbs
Gordon Hayward 30 03/23/90 Indianapolis, IN Butler 9 56f675cb8f40a5aaee5f5747ec9099c5 NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 7" 4249 20 Gordon Hayward SF $32,700,690 boston-celtics s:40~l:46~a:4249 225 lbs

Now, all_players_df is a DataFrame with all the players in the NBA categorized by team. It contains player information such as age, salary, height, weight, etc. I'll export this data to a csv file, in case you readers out there want to read it in and play around with it yourself.

all_players_df.to_csv("NBA_roster_info_all_players_mar2020.csv")

Scrape player stats for career averages

We also want to scrape data coresponding to the performance of each player, in terms of points per game, field goal percentage, rebounds per game, etc. Our goal is to append this information to all_players_df so that we can compare player performance with traits such as height, salary, etc. We can find performance stats at each player's personal page on ESPN:

curry_stats_webpage.png

We'll want to extract the career stats in the bottom row, which can be found in the highlighted section of the source code below:

curry_stats_source.png

In order to extract the information above for each player in our DataFrame, we can construct URLs for player stats pages using the id column. Fortunately, the URL is standardized and very easy to construct. For example, using the id value of 3975 for Stephen Curry, the URL to open would be: https://www.espn.com/nba/player/stats/_/id/3975. Below is an example of extracting his career stats using regexes:

url = "https://www.espn.com/nba/player/stats/_/id/3975"
f = urllib.request.urlopen(url)
sleep(0.3)
player_source = f.read().decode('utf-8')
# extract career stats using this regex
stats_regex = ('\[\"Career\",\"\",(.*?)\]\},\{\"ttl\"\:\"Regular Season Totals\"')
career_info = re.findall(stats_regex, player_source)
print(career_info)
['"699","693","34.3","8.1-17.1","47.6","3.6-8.2","43.5","3.7-4.0","90.6","0.7","3.8","4.5","6.6","0.2","1.7","2.5","3.1","23.5"']

We observe that some of the stats are complex and contain non-numerical symbols such as "-". In the example above, the range "3.7-4.0" is for the column "FT", which stands for "Free Throws Made-Attempted Per Game". We should split this up into two categories, "Free Throws Made (FTM)" and "Free Throws Attempted (FTA)", and do the same for field goals and 3 pointers. To do so, we can split the string based on "-" and then un-nest the list. We also need to convert the strings to floating point values.

from itertools import chain
career_info = career_info[0].replace("\"", "").split(",")
career_info = list(chain.from_iterable([i.split("-") for i in career_info]))
career_info = list(map(float,career_info))
print(career_info)
[699.0, 693.0, 34.3, 8.1, 17.1, 47.6, 3.6, 8.2, 43.5, 3.7, 4.0, 90.6, 0.7, 3.8, 4.5, 6.6, 0.2, 1.7, 2.5, 3.1, 23.5]

Now we can loop through each player in all_players_df, open their stats webpage, extract their career stats, and store the stats in a separate data frame called career_stats_df using the code below:

career_stats_df = pd.DataFrame(columns = ["GP","GS","MIN","FGM", "FGA","FG%","3PTM","3PTA","3P%","FTM","FTA","FT%","OR","DR","REB","AST","BLK","STL","PF","TO","PTS"])
for player_index in all_players_df.index:
    url = "https://www.espn.com/nba/player/stats/_/id/" + str(all_players_df.loc[player_index]['id'])
    f = urllib.request.urlopen(url)
    sleep(0.3)
    player_source = f.read().decode('utf-8')
    # extract career stats using this regex
    stats_regex = ('\[\"Career\",\"\",(.*?)\]\},\{\"ttl\"\:\"Regular Season Totals\"')
    career_info = re.findall(stats_regex, player_source)
    try:
        # convert the stats to a list of floats, and add the entry to the DataFrame
        career_info = career_info[0].replace("\"", "").split(",")
        career_info = list(chain.from_iterable([i.split("-") for i in career_info]))
        career_info = list(map(float,career_info))
        career_stats_df = career_stats_df.append(pd.Series(career_info, index = career_stats_df.columns, name=player_index))
    except:
        # if no career stats were returned, the player was a rookie with no games played
        print(player_index + " has no info, ", end = "")

Some player webpages did not have career stats, which I found corresponded to rookies which had no games played. This threw an error in the loop, so I used a try/except clause to bypass the error and continue stripping content for the remaining players. Their stats are currently stored in the object career_stats_df:

career_stats_df.head(5)
GP GS MIN FGM FGA FG% 3PTM 3PTA 3P% FTM ... FT% OR DR REB AST BLK STL PF TO PTS
Brad Wanamaker 99.0 1.0 15.7 1.9 4.3 44.0 0.5 1.3 38.0 1.3 ... 91.7 0.2 1.4 1.7 2.2 0.1 0.6 1.6 0.9 5.6
Carsen Edwards 35.0 0.0 9.0 1.1 3.2 32.7 0.6 1.9 30.9 0.3 ... 84.6 0.2 1.1 1.2 0.6 0.1 0.3 0.9 0.4 3.0
Daniel Theis 187.0 62.0 17.2 2.6 4.7 55.4 0.4 1.1 34.0 1.1 ... 75.3 1.6 3.1 4.7 1.2 0.9 0.5 2.8 0.7 6.7
Enes Kanter 634.0 222.0 21.8 4.8 8.8 54.2 0.1 0.2 28.7 2.0 ... 77.6 2.9 4.7 7.6 0.9 0.5 0.4 2.2 1.5 11.6
Gordon Hayward 634.0 472.0 30.8 5.2 11.6 45.1 1.3 3.6 36.6 3.5 ... 82.2 0.7 3.6 4.4 3.5 0.4 1.0 1.7 2.0 15.3

5 rows × 21 columns

The stats for each player are now organized in a neat DataFrame. Here is a legend for what each of the abbreviations mean:

I'll also export these stats to a csv file:

career_stats_df.to_csv("NBA_player_stats_all_mar2020.csv")

Joining and cleaning the data

We will now join career_stats_df with all_players_df, which will merge the content from both data frames based on rows that have the same index (player name). Players in all_players_df that are not included in career_stats_df will have NaN values for the joined columns.

all_stats_df = all_players_df.join(career_stats_df)
all_stats_df.head(5)
age birthDate birthPlace college experience guid hand headshot height id ... FT% OR DR REB AST BLK STL PF TO PTS
Brad Wanamaker 30 07/25/89 Philadelphia, PA Pittsburgh 1 5aad35bbbb760e3958107639266768ae NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 3" 6507 ... 91.7 0.2 1.4 1.7 2.2 0.1 0.6 1.6 0.9 5.6
Carsen Edwards 22 03/12/98 Houston, TX Purdue 0 4b8ebdfd01221567925035c1e0d0c337 NaN https://a.espncdn.com/i/headshots/nba/players/... 5' 11" 4066407 ... 84.6 0.2 1.1 1.2 0.6 0.1 0.3 0.9 0.4 3.0
Daniel Theis 27 04/04/92 Germany NaN 2 ce75206c087f83ace6f9a8e3efbd9671 NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 8" 2451037 ... 75.3 1.6 3.1 4.7 1.2 0.9 0.5 2.8 0.7 6.7
Enes Kanter 27 05/20/92 Switzerland Kentucky 8 1e039b407b3daa6eeac69432aa6413fd NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 10" 6447 ... 77.6 2.9 4.7 7.6 0.9 0.5 0.4 2.2 1.5 11.6
Gordon Hayward 30 03/23/90 Indianapolis, IN Butler 9 56f675cb8f40a5aaee5f5747ec9099c5 NaN https://a.espncdn.com/i/headshots/nba/players/... 6' 7" 4249 ... 82.2 0.7 3.6 4.4 3.5 0.4 1.0 1.7 2.0 15.3

5 rows × 38 columns

The performance stats have been added as columns on the right side of the DataFrame.

We notice that some of the columns which should contain numerical data such as salary, height, and weight are instead considered strings. This is beacuse they contain non-numerical characters (such as '$' and ',' for salary). To be able to compute statistics on these columns, we need to convert them to numeric values.

We can convert salaries to numeric by removing all non-numerical characters and converting to int using list comprehension:

# before converting
all_stats_df['salary'].head(3)
Brad Wanamaker    $1,445,697
Carsen Edwards    $1,228,026
Daniel Theis      $5,000,000
Name: salary, dtype: object
all_stats_df['salary']=[int(re.sub(r'[^\d.]+', '', s)) if isinstance(s, str) else s for s in all_stats_df['salary'].values]
# after converting
all_stats_df['salary'].head(3)
Brad Wanamaker    1445697.0
Carsen Edwards    1228026.0
Daniel Theis      5000000.0
Name: salary, dtype: float64

Height is also provided in a non-numeric form, in feet plus inches (e.g. 6' 5"). We should convert this to a numeric form so that statistics can be calculated. To do so, we will write a small function that converts a string of feet plus inches into a numeric value of total inches, convert_height.

def convert_height(height):
    split_height = height.split(" ")
    feet = float(split_height[0].replace("\'",""))
    inches = float(split_height[1].replace("\"",""))
    return (feet*12 + inches)
# before conversion
all_stats_df['height'].head(3)
Brad Wanamaker     6' 3"
Carsen Edwards    5' 11"
Daniel Theis       6' 8"
Name: height, dtype: object
all_stats_df['height'] = [convert_height(x) for x in all_stats_df['height']]
# after conversion
all_stats_df['height'].head(3)
Brad Wanamaker    75.0
Carsen Edwards    71.0
Daniel Theis      80.0
Name: height, dtype: float64

Weight is also a non-numerical field, because of the units listed (e.g. weight': '230 lbs'). We will simply strip off the units for each entry by splitting the string in half with split(" ") and taking the left side of the split.

# before conversion
all_stats_df['weight'].head(3)
Brad Wanamaker    210 lbs
Carsen Edwards    200 lbs
Daniel Theis      245 lbs
Name: weight, dtype: object
all_stats_df['weight'] = [float(x.split(" ")[0]) for x in all_stats_df['weight']]
# after conversion
all_stats_df['weight'].head(3)
Brad Wanamaker    210.0
Carsen Edwards    200.0
Daniel Theis      245.0
Name: weight, dtype: float64

This should be the last of the values we have to convert to numeric. Now, we have a cleaned-up and joined dataset! I'll save the data to a csv file.

all_stats_df.to_csv("NBA_player_info_and_stats_joined_mar2020.csv")

If you want to read in the data at a later time, you can use read_csv() like so:

all_stats_df = pd.read_csv("NBA_player_info_and_stats_joined_mar2020.csv", index_col=0)

Calculating statistics

We can use the data we just compiled to calculate some statistics. Let's start by calculating average stats per team, using groupby() with mean() in pandas.

# calculate means and remove irrelevant columns for id and jersey #
mean_df = all_stats_df.groupby('team').mean().drop(['id','jersey'],1)
mean_df
age experience height salary weight GP GS MIN FGM FGA ... FT% OR DR REB AST BLK STL PF TO PTS
team
atlanta-hawks 25.538462 4.307692 79.000000 5.608001e+06 218.615385 314.153846 200.076923 22.215385 3.815385 8.146154 ... 74.423077 1.223077 3.184615 4.407692 2.323077 0.569231 0.669231 2.146154 1.430769 10.300000
boston-celtics 25.000000 2.500000 77.687500 7.228917e+06 224.062500 218.000000 121.687500 17.718750 2.787500 6.212500 ... 76.050000 0.712500 2.356250 3.081250 1.562500 0.356250 0.612500 1.631250 1.006250 7.662500
brooklyn-nets 25.882353 4.529412 78.529412 7.928121e+06 217.352941 310.823529 214.000000 20.905882 3.482353 7.652941 ... 71.711765 0.864706 2.970588 3.835294 2.047059 0.476471 0.641176 1.688235 1.288235 9.582353
charlotte-hornets 24.266667 2.733333 78.733333 6.772335e+06 216.066667 203.400000 103.466667 19.566667 2.626667 5.973333 ... 75.086667 0.933333 2.773333 3.680000 1.506667 0.440000 0.626667 1.746667 1.000000 7.013333
chicago-bulls 24.666667 2.533333 79.000000 5.392607e+06 217.266667 199.400000 105.666667 20.033333 3.100000 6.866667 ... 73.613333 0.793333 2.580000 3.360000 1.826667 0.326667 0.726667 1.773333 1.020000 8.180000
cleveland-cavaliers 24.866667 2.933333 78.333333 8.744085e+06 224.333333 253.615385 165.538462 20.376923 3.184615 6.953846 ... 62.484615 1.246154 3.069231 4.300000 1.638462 0.353846 0.546154 1.692308 1.200000 8.407692
dallas-mavericks 26.500000 3.250000 79.250000 7.593353e+06 220.333333 237.750000 105.666667 19.083333 3.233333 7.058333 ... 70.833333 0.791667 2.725000 3.533333 1.808333 0.466667 0.575000 1.566667 1.033333 8.933333
denver-nuggets 25.928571 4.285714 79.285714 8.798127e+06 224.142857 347.153846 185.307692 20.976923 3.423077 7.330769 ... 75.776923 1.061538 2.946154 3.984615 2.030769 0.469231 0.700000 1.846154 1.107692 9.092308
detroit-pistons 25.000000 3.411765 78.000000 6.505785e+06 208.529412 235.058824 132.294118 18.876471 2.911765 6.652941 ... 64.917647 0.711765 2.335294 3.029412 1.723529 0.341176 0.517647 1.635294 1.000000 7.900000
golden-state-warriors 25.076923 3.153846 77.538462 1.173546e+07 209.153846 244.692308 187.692308 24.076923 4.023077 9.100000 ... 77.800000 0.792308 2.853846 3.661538 2.400000 0.384615 0.784615 2.115385 1.430769 11.123077
houston-rockets 28.800000 7.266667 77.666667 7.617278e+06 212.933333 494.400000 332.266667 21.860000 3.353333 7.773333 ... 71.300000 0.820000 2.766667 3.600000 2.013333 0.406667 0.773333 1.953333 1.406667 9.706667
indiana-pacers 25.250000 3.500000 78.500000 7.942772e+06 214.500000 251.166667 116.250000 19.841667 3.241667 7.150000 ... 76.050000 0.666667 2.725000 3.416667 1.716667 0.491667 0.625000 1.750000 1.066667 8.633333
la-clippers 27.250000 5.312500 78.437500 7.520664e+06 217.500000 362.562500 185.875000 20.318750 3.337500 7.175000 ... 74.587500 1.018750 2.862500 3.893750 1.862500 0.406250 0.662500 1.856250 1.087500 9.112500
los-angeles-lakers 29.133333 7.733333 78.666667 6.905793e+06 222.133333 579.928571 425.071429 25.071429 4.528571 9.514286 ... 70.414286 1.121429 3.707143 4.807143 2.507143 0.678571 0.914286 1.957143 1.614286 11.985714
memphis-grizzlies 24.800000 2.866667 78.866667 6.021349e+06 219.066667 234.857143 100.000000 20.107143 3.200000 6.800000 ... 73.071429 0.935714 2.892857 3.821429 1.950000 0.421429 0.635714 1.692857 1.171429 8.328571
miami-heat 27.000000 4.937500 79.000000 8.439172e+06 222.187500 381.266667 210.200000 22.126667 3.326667 7.326667 ... 71.233333 0.840000 2.980000 3.833333 2.013333 0.313333 0.713333 1.840000 1.140000 9.180000
milwaukee-bucks 29.466667 7.666667 78.733333 7.836516e+06 227.200000 565.066667 371.266667 23.466667 3.820000 8.233333 ... 77.946667 0.973333 3.133333 4.120000 1.966667 0.513333 0.760000 1.953333 1.266667 10.413333
minnesota-timberwolves 23.571429 2.285714 77.857143 4.248132e+06 216.214286 172.500000 56.000000 16.978571 2.442857 5.721429 ... 74.914286 0.657143 2.128571 2.764286 1.457143 0.350000 0.585714 1.507143 0.850000 6.535714
new-orleans-pelicans 25.200000 3.400000 77.933333 7.031152e+06 221.000000 244.200000 150.066667 22.960000 3.766667 8.046667 ... 64.346667 1.040000 2.973333 4.020000 2.193333 0.506667 0.720000 2.020000 1.440000 9.946667
new-york-knicks 25.066667 3.666667 78.266667 6.224969e+06 217.933333 290.357143 137.785714 21.492857 3.342857 7.371429 ... 72.535714 1.035714 2.857143 3.914286 1.914286 0.464286 0.657143 1.892857 1.214286 8.757143
oklahoma-city-thunder 25.066667 3.866667 78.533333 9.111964e+06 217.000000 281.533333 187.466667 18.613333 2.560000 5.713333 ... 61.446667 0.793333 2.320000 3.106667 1.566667 0.400000 0.680000 1.673333 0.920000 6.973333
orlando-magic 24.833333 3.166667 79.750000 9.398300e+06 219.083333 234.416667 136.916667 19.333333 3.016667 6.750000 ... 71.383333 0.966667 2.866667 3.841667 1.458333 0.525000 0.625000 1.575000 0.900000 7.816667
philadelphia-76ers 25.714286 3.714286 78.642857 8.825191e+06 218.500000 263.214286 156.785714 20.685714 3.492857 7.550000 ... 67.207143 0.857143 3.164286 4.021429 2.107143 0.571429 0.671429 1.835714 1.314286 9.392857
phoenix-suns 24.142857 2.357143 78.857143 5.896355e+06 213.142857 191.071429 101.357143 19.621429 3.092857 6.892857 ... 82.428571 0.907143 2.678571 3.578571 2.007143 0.314286 0.657143 1.764286 1.157143 8.385714
portland-trail-blazers 25.400000 4.666667 80.000000 8.704394e+06 226.533333 336.066667 247.000000 20.186667 3.626667 7.993333 ... 70.940000 1.026667 3.040000 4.080000 1.640000 0.473333 0.620000 1.913333 1.240000 9.786667
sacramento-kings 26.153846 3.769231 78.000000 7.316023e+06 214.538462 298.153846 142.615385 19.369231 3.238462 7.061538 ... 63.053846 0.784615 2.530769 3.323077 1.584615 0.353846 0.607692 1.653846 0.992308 8.492308
san-antonio-spurs 26.200000 4.933333 78.600000 7.264785e+06 219.733333 371.866667 218.533333 18.746667 3.120000 7.026667 ... 66.606667 0.740000 2.493333 3.240000 1.753333 0.353333 0.553333 1.393333 0.946667 8.213333
toronto-raptors 25.937500 3.562500 78.437500 7.590898e+06 214.937500 268.562500 169.625000 17.556250 2.606250 5.893750 ... 76.187500 0.712500 2.487500 3.200000 1.618750 0.450000 0.593750 1.668750 0.912500 7.143750
utah-jazz 25.857143 3.785714 77.928571 8.142802e+06 219.928571 293.071429 170.214286 18.207143 3.057143 6.535714 ... 57.978571 0.800000 2.514286 3.328571 1.757143 0.371429 0.542857 1.457143 1.142857 8.185714
washington-wizards 25.133333 3.333333 78.733333 7.772450e+06 216.733333 231.333333 109.000000 19.233333 3.080000 6.620000 ... 74.406667 0.853333 2.493333 3.346667 1.973333 0.366667 0.606667 1.920000 1.113333 8.360000

30 rows × 26 columns

As you can see, the index of the data frame that is returned corresponds to each individual team now, and the mean values are displayed for each of the columns with numerical values. To find the team with the highest averages for a specific stat, we can use the sort_values() function. Let's find the top 5 teams with the highest average salary:

mean_df.sort_values('salary', ascending=False).head(5)
age experience height salary weight GP GS MIN FGM FGA ... FT% OR DR REB AST BLK STL PF TO PTS
team
golden-state-warriors 25.076923 3.153846 77.538462 1.173546e+07 209.153846 244.692308 187.692308 24.076923 4.023077 9.100000 ... 77.800000 0.792308 2.853846 3.661538 2.400000 0.384615 0.784615 2.115385 1.430769 11.123077
orlando-magic 24.833333 3.166667 79.750000 9.398300e+06 219.083333 234.416667 136.916667 19.333333 3.016667 6.750000 ... 71.383333 0.966667 2.866667 3.841667 1.458333 0.525000 0.625000 1.575000 0.900000 7.816667
oklahoma-city-thunder 25.066667 3.866667 78.533333 9.111964e+06 217.000000 281.533333 187.466667 18.613333 2.560000 5.713333 ... 61.446667 0.793333 2.320000 3.106667 1.566667 0.400000 0.680000 1.673333 0.920000 6.973333
philadelphia-76ers 25.714286 3.714286 78.642857 8.825191e+06 218.500000 263.214286 156.785714 20.685714 3.492857 7.550000 ... 67.207143 0.857143 3.164286 4.021429 2.107143 0.571429 0.671429 1.835714 1.314286 9.392857
denver-nuggets 25.928571 4.285714 79.285714 8.798127e+06 224.142857 347.153846 185.307692 20.976923 3.423077 7.330769 ... 75.776923 1.061538 2.946154 3.984615 2.030769 0.469231 0.700000 1.846154 1.107692 9.092308

5 rows × 26 columns

Looks like the highest average salary is paid by the Golden State Warriors. Similarly, we can find the top 10 highest paid players by sorting all_stats_df on salary, then pulling out the top entries for the 'salary' and 'team' columns:

all_stats_df.sort_values('salary', ascending=False)[['salary','team']].head(10)
salary team
Stephen Curry 40231758.0 golden-state-warriors
Russell Westbrook 38506482.0 houston-rockets
Chris Paul 38506482.0 oklahoma-city-thunder
Kevin Durant 38199000.0 brooklyn-nets
James Harden 38199000.0 houston-rockets
John Wall 38199000.0 washington-wizards
LeBron James 37436858.0 los-angeles-lakers
Kyle Lowry 34996296.0 toronto-raptors
Blake Griffin 34449964.0 detroit-pistons
Kemba Walker 32742000.0 boston-celtics

Stephen Curry is the highest paid player in the NBA with a whopping salary of $40,231,758, followed by Russell Westbrook. We can continue to sift through the data this way for whatever piques our interest. Given how many different variables there are, we can write a small function to make things easier:

def top_n(df, category, n):
    return (df.sort_values(category, ascending=False)[[category,'team']].head(n))

This way, we can quickly identify the top n players for any given category in a DataFrame. Let's cycle through some stats of interest:

top_n(all_stats_df, 'PTS', 5)
PTS team
LeBron James 27.1 los-angeles-lakers
Kevin Durant 27.0 brooklyn-nets
James Harden 25.1 houston-rockets
Luka Doncic 24.4 dallas-mavericks
Joel Embiid 24.1 philadelphia-76ers
top_n(all_stats_df, 'REB', 5)
REB team
Andre Drummond 13.8 cleveland-cavaliers
Dwight Howard 12.3 los-angeles-lakers
Hassan Whiteside 11.8 portland-trail-blazers
Joel Embiid 11.5 philadelphia-76ers
Kevin Love 11.1 cleveland-cavaliers
top_n(all_stats_df, 'height', 5)
height team
Tacko Fall 89.0 boston-celtics
Boban Marjanovic 88.0 dallas-mavericks
Kristaps Porzingis 87.0 dallas-mavericks
Moses Brown 86.0 portland-trail-blazers
Bol Bol 86.0 denver-nuggets
top_n(all_stats_df, 'weight', 5)
weight team
Tacko Fall 311.0 boston-celtics
Jusuf Nurkic 290.0 portland-trail-blazers
Boban Marjanovic 290.0 dallas-mavericks
Nikola Jokic 284.0 denver-nuggets
Zion Williamson 284.0 new-orleans-pelicans

Interestingly, Tacko Fall of the Boston Celtics is both the tallest and the heaviest player in the NBA.

To get a high level overview of how each statistic correlates with one another, we can generate a correlation matrix using corr() and matplotlib.

corr_matrix = all_stats_df.drop(['id','jersey'],1).corr()
import matplotlib.pyplot as plt

f = plt.figure(figsize=(19, 15))
plt.matshow(corr_matrix, fignum=f.number)
plt.xticks(range(corr_matrix.shape[1]), corr_matrix.columns, fontsize=14, rotation=45, ha = 'left')
plt.yticks(range(corr_matrix.shape[1]), corr_matrix.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

png

We can learn a lot about how different statistics are associated with each other from this matrix, and also identify some interesting trends. For example:

We can narrow in on correlations of interest by sorting the correlation matrix. Let's try sorting by salary and identifying the top correlates:

corr_matrix.sort_values('salary', ascending=False)['salary'].head(10)
salary    1.000000
PTS       0.712635
FTM       0.707054
GS        0.703869
FGM       0.699154
FGA       0.686631
FTA       0.681934
MIN       0.663697
TO        0.648611
STL       0.602140
Name: salary, dtype: float64

As we suspected, points per game (PTS) is most highly correlated with salary, followed by other point-related stats such as free throws made (FTM). Games started (GS) is also highly correlated with salary, which makes sense since highly-paid players are typically better and will be starters.

If we want to model how much more a player costs based on increases in points per game, an easy way is to use linear regression (OLS). To do so, we will use scikit-learn. The LinearRegression() function cannot handle null values, so we will remove players that don't have reported salaries or PTS values first:

from sklearn.linear_model import LinearRegression
# remove rows with null values for regression
reg_df = all_stats_df[['salary', 'PTS']].dropna()

Then, we will fit the model with the predictor variable (X) being PTS and the dependent variable (Y) being salary. We will set fit_intercept=False since players cannot be paid less than $0.00 or score less than 0 PTS:

X = reg_df['PTS'].values.reshape(-1,1)
Y = reg_df['salary'].values.reshape(-1,1)
reg = LinearRegression(fit_intercept=False).fit(X,Y)
y_pred = reg.predict(X)
plt.figure(figsize=(12, 6))
plt.scatter(X, Y)
plt.plot(X, y_pred, color='red')
plt.xlabel("Points per game (Career)")
plt.ylabel("Salary (2020)")
plt.title('Salary vs PTS - simple linear regression', fontsize=16);

png

Consistent with the positive correlation we calculated previously, a regression line with a positive slope is fitted to the data. We can extract the slope of the line by getting the coefficient using .coef_:

print(reg.coef_)
[[947619.16030932]]

This was only meant to be a demonstration of what could be done with the data that we scraped. Better models can definitely be generated, especially given the nature of the data. Just by looking at the fit above, we can see that the residuals will be heteroskedastic. There are also a small number of players with high career points per game but low salaries in the bottom right corner of the plot which are skewing the regression line.

Taking into account these caveats, the value of the slope is ~947619.16. This suggests that for every unit increase in points per game, the predicted salary paid to a player increases by $947,619.16! Looks like making that free throw really does count.

Conclusion

Here, I used Python to scrape ESPN for statistics on all the players in the NBA using the urllib and re packages. Then, I used pandas and scikit-learn to organize the data and calculate some summary statistics.

I hope what you've learned from this project will help you out on your own web scraping quests. The techniques that I've outlined here should be broadly applicable for other websites. In general, webpages that link to subpages within the same site will construct their links in some sort of standardized pattern. If so, you can construct URLs for the subpages and loop through them as we have done here. Next time you find yourself flipping through a website and copy-pasting, consider trying to automate the process using Python!