BurntSushi / nfldb

A library to manage and update NFL data in a relational database.
The Unlicense
1.08k stars 264 forks source link

Data before 2009 #82

Open andr3w321 opened 9 years ago

andr3w321 commented 9 years ago

Are there are any scripts already written to get data before 2009? I see that data at http://www.nfl.com/ajax/scorestrip?season=2014&seasonType=REG&week=1 goes back to 1970 for regular season. Is there an api doc page for nfl.com somewhere? This only gives gsis_ids, datetimes and scores. I'd be interested in more stats even if they're not complete drive by drive stats.

Update: I see you can match the gsis id with gamebook data here: http://www.nflgsis.com/1999/Reg/15/796/Gamebook.xml

In a somewhat unrelated note is it normal for nflgame-update-players to require sudo? I installed with pip install nfldb --user and I get I do not have write access to "/usr/local/lib/python2.7/dist-packages/nflgame/players.json". when I try running it without.

BurntSushi commented 9 years ago

Are there are any scripts already written to get data before 2009?

Not that I'm aware of.

I see that data at http://www.nfl.com/ajax/scorestrip?season=2014&seasonType=REG&week=1 goes back to 1970 for regular season.

This is only the schedule.

Is there an api doc page for nfl.com somewhere?

NFL.com has no API. It has an undocumented JSON feed that goes back to 2009. I'm not aware of any documentation, although this part of nflgame maps statistical category identifiers to names.

Update: I see you can match the gsis id with gamebook data here: http://www.nflgsis.com/1999/Reg/15/796/Gamebook.xml

This contains very little structured data. Notably, the plays are encoded as human readable descriptions only. Parsing them to get the granularity of detail expected by nfldb is hard. (But I've wanted to try for a long time.)

In a somewhat unrelated note is it normal for nflgame-update-players to require sudo?

Only if you've installed nflgame to a system directory. It updates a JSON file that is stored inside the package directory.

I installed with pip install nfldb --user and I get I do not have write access to "/usr/local/lib/python2.7/dist-packages/nflgame/players.json". when I try running it without.

Not sure. On my system, --user installs to my home directory. You might try a virtual environment:

$ virtualenv nfl
$ source nfl/bin/activate
$ pip install nfldb
$ nflgame-update-players
andr3w321 commented 9 years ago

Thanks for the reply and great project. I initially mistakenly installed the project with sudo pip install nfldb but when I removed it with sudo pip uninstall nfldb the lib files were not removed. After manually deleting the files and reinstalling locally it's working fine without sudo

The NFL.com json feed has a bit more than schedule. It lists scores too. I'll write up something to download and import them.

BurntSushi commented 9 years ago

You shouldn't have to write anything. I think nflgame can handle them: https://github.com/BurntSushi/nflgame/blob/master/nflgame/update_sched.py

andr3w321 commented 9 years ago

Wish I'd read this earlier. Just got done downloading them all. I'll post the download script and link to tarball of them anyways. The script did skip a few that I had to download manually later. I'm assuming due to server errors so I wouldn't 100% trust it to get them all with out double checking later.

https://drive.google.com/file/d/0BwoaqcZEOIvoLXpuY3FCelVKV2M/view?usp=sharing

import urllib, os
import time

download_folder = "nfl-schedules"
root_url = "http://www.nfl.com/ajax/scorestrip?"

if not os.path.exists(download_folder):
    os.makedirs(download_folder)

def download_xml(season, week, season_type):
    # download xml file schedule from NFL.com
    url = root_url + "season=" + str(season) + "&seasonType=" + season_type + "&week=" + str(week)
    filename = download_folder + "/" + str(season) + "-" + str(week) + "-" + season_type + ".xml"
    urllib.urlretrieve(url, filename)

    # delete file if it contains no games
    file = open(filename, 'r')
    if not "gms" in file.read():
        os.remove(filename)
    file.close()

    # be nice and sleep 2 seconds between requests
    time.sleep(2)

for season in range(1970,2015):
    season_type = "REG"
    for week in range(1,18):
        download_xml(season, week, season_type)
    season_type = "PRE"
    for week in range(1,5):
        download_xml(season, week, season_type)
    season_type = "POST"
    for week in range(15,23):
        download_xml(season, week, season_type)
andr3w321 commented 9 years ago

I was able to get the data into the schedule.json file, but I'm having a hard time figuring out how to get it in the postgres db? Where is the code that imports the json file? I created a modified update_sched.py file here https://github.com/andr3w321/nflgame/blob/master/nflgame/update_sched.py with a function to import the folder I linked to above python update_sched.py --build-old ../../nfl-schedules I added home_score and away_score to lines 106 and 107 to the json so I'm not sure where else to update these field names.

BurntSushi commented 9 years ago

https://github.com/BurntSushi/nfldb/blob/master/nfldb/update.py#L296-L308

andr3w321 commented 9 years ago

Without some help I think I'm going to give up attempting to add a pull request and just create a csv and work directly with sql to get the data in the db.

I edited the file https://github.com/andr3w321/nflgame/blob/master/nflgame/update_sched.py to create the new function that successfully adds the schedules for the old games after running python update_sched.py --build-old ../../nfl-schedules and then ~/.local/bin/nfldb-update However, there's a few issues

  1. It's slow. The function doing the update https://github.com/BurntSushi/nfldb/blob/master/nfldb/update.py#L178-221 is slow with a batch size of 5 games when adding 9,625 new games.
  2. The following team names are missing: ['BOS', 'Boston', 'Patriots'], ['LA', 'Los Angeles', 'RaidersorRams'], ['PHO', 'Phoenix', 'Cardinals'], ['RAI', 'LAorOakland', 'Raiders'], ['RAM', 'LAorSTL', 'Rams'] I'm not sure of the fields with or in them obviously. I tried adding these teams to https://github.com/BurntSushi/nfldb/blob/master/nfldb/team.py but when I do and run nfldb-update it errors out with psycopg2.IntegrityError: insert or update on table "game" violates foreign key constraint "game_home_team_fkey" DETAIL: Key (home_team)=(LA) is not present in table "team". As of now it works okay, just every time the xml file references BOS, LA, PHO, RAI, or RAM it will input UNK in the home or away team DB entry.
  3. I still can't figure out how to get the scores into the database. I thought by adding https://github.com/andr3w321/nflgame/blob/master/nflgame/update_sched.py#L109-110 which matches https://github.com/BurntSushi/nfldb/blob/master/nfldb/types.py#L2096 and https://github.com/BurntSushi/nfldb/blob/master/nfldb/types.py#L2104 it would take care of it, but all the scores show up as 0 for the new games added.

I uploaded my schedule.json file here: https://drive.google.com/file/d/0BwoaqcZEOIvoeXdHRDV0SlJPbWc/view?usp=sharing

BurntSushi commented 9 years ago

is slow with a batch size of 5 games when adding 9,625 new games.

Increase the batch size? The batch size should be increased when doing large bulk loads. It is kept small by default since most updates are small and incremental. Larger batch sizes require more memory.

I'm not sure of the fields with or in them obviously. I tried adding these teams to https://github.com/BurntSushi/nfldb/blob/master/nfldb/team.py but when I do and run nfldb-update it errors out with psycopg2.IntegrityError: insert or update on table "game" violates foreign key constraint "game_home_team_fkey" DETAIL: Key (home_team)=(LA) is not present in table "team". As of now it works okay, just every time the xml file references BOS, LA, PHO, RAI, or RAM it will input UNK in the home or away team DB entry.

A foreign key constraint means the teams aren't in the DB. You'll probably need to add a migration in nfldb/db.py.

My goodness am I glad we have FK constraints. They've prevented innumerable bugs from ever happening!

I still can't figure out how to get the scores into the database.

Not sure.

Can you submit a PR anyway? I might be able to just build on what you've done and finish it. (It might be a little though.)

andr3w321 commented 9 years ago

Thanks for your help. I wrote some code to look at the team names and mascots. The only city name in question is RAI, which looks like it refers to the Los Angeles Raiders http://www.pro-football-reference.com/teams/rai/ The problem with just adding the unknown teams listed above is it will describe the 1983 Baltimore Ravens which were actually the Colts so to get it really done correctly would require adding the below teams.txt table to the db. I submitted a pull request with the update_sched.py --build-old method I added.

print_teams_table.py

import xml.dom.minidom as xml
import os
import re 

nfl_schedules_path = './nfl-schedules'

def sort_nicely( l ): 
  """ Sort the given list in the way that humans expect. 
  """ 
  convert = lambda text: int(text) if text.isdigit() else text 
  alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
  l.sort( key=alphanum_key ) 

def get_filenames(dir_path, starts_with, ends_with):
    """
    Returns a list of all filenames in a dir
    that starts with a given string 
    and ends with a given string
    eg: get_filenames("./nfl-schedules", "1970", ".xml")
    """
    filenames = []
    try:
        for file in os.listdir(dir_path):
            if file.startswith(starts_with) and file.endswith(ends_with):
                filenames.append(file)
    except OSError:
        sys.stderr.write("could not load %s\n" % dir_path)
    return filenames

def add_sym_mascot_year(sym, mascot, year, teams):
    team_idx = -1
    for i in range(0, len(teams)):
        if teams[i][0] == sym and teams[i][1] == mascot:
            team_idx = i
            break
    if team_idx == -1:
        teams.append([sym.encode('utf-8'), mascot.encode('utf-8'), [year]])
    else:
        if year not in teams[team_idx][2]:
            teams[team_idx][2].append(year)
    return teams

xml_filenames = get_filenames(nfl_schedules_path, "", ".xml")
sort_nicely(xml_filenames)
xml_filenames.reverse()
teams = []
for xml_file in xml_filenames:
    year,week,stype = xml_file.split(".xml")[0].split("-")
    year = int(year)
    week = int(week)

    xml_filename = nfl_schedules_path + "/" + str(year) + "-" + str(week) + "-" + stype + ".xml"
    try:
        dom = xml.parse(open(xml_filename))
    except IOError:
        sys.stderr.write("could not load %s\n" % xml_filename)

    for g in dom.getElementsByTagName("g"):
        home_sym = g.getAttribute('h')
        home_mas = g.getAttribute('hnn')
        away_sym = g.getAttribute('v')
        away_mas = g.getAttribute('vnn')

        teams = add_sym_mascot_year(home_sym, home_mas, year, teams)
        teams = add_sym_mascot_year(away_sym, away_mas, year, teams)
teams.sort()
for team in teams:
    print team

python print_teams_table.py > teams.txt

['ARI', 'cardinals', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994]]
['ATL', 'falcons', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['BAL', 'colts', [1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['BAL', 'ravens', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996]]
['BOS', 'patriots', [1970]]
['BUF', 'bills', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['CAR', 'panthers', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995]]
['CHI', 'bears', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['CIN', 'bengals', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['CLE', 'browns', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['DAL', 'cowboys', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['DEN', 'broncos', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['DET', 'lions', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['GB', 'packers', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['HOU', 'oilers', [1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['HOU', 'texans', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002]]
['IND', 'colts', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984]]
['JAC', 'jaguars', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995]]
['KC', 'chiefs', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['LA', 'rams', [1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['MIA', 'dolphins', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['MIN', 'vikings', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['NE', 'patriots', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971]]
['NO', 'saints', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['NYG', 'giants', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['NYJ', 'jets', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['OAK', 'raiders', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['PHI', 'eagles', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['PHO', 'cardinals', [1993, 1992, 1991, 1990, 1989, 1988]]
['PIT', 'steelers', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['RAI', 'raiders', [1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982]]
['RAM', 'rams', [1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982]]
['SD', 'chargers', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['SEA', 'seahawks', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976]]
['SF', '49ers', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['STL', 'cardinals', [1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
['STL', 'rams', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995]]
['TB', 'buccaneers', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976]]
['TEN', 'oilers', [1998, 1997]]
['TEN', 'titans', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999]]
['WAS', 'redskins', [2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970]]
andr3w321 commented 9 years ago

For a quick migration ignoring years this should work

INSERT INTO team (team_id, city, name) VALUES
('BOS', 'Boston', 'Patriots'),
('LA', 'Los Angeles', 'Rams'),
('PHO', 'Phoenix', 'Cardinals'),
('RAI', 'Los Angeles', 'Raiders'),
('RAM', 'Los Angeles', 'Rams');

Don't forget to add the same to https://github.com/BurntSushi/nfldb/blob/master/nfldb/team.py

andr3w321 commented 9 years ago

Okay I finally got the scores loaded in. It's far from an elegant solution, but it works. Some mascot names will still be incorrect. I added a folder with my scripts here: https://github.com/andr3w321/nflgame/tree/master/build-old-scripts The steps to get all the old game scores in the database are

  1. Download the nfl-schedules xml files. They're in my fork or the tar I linked to earlier.
  2. Add the old teams to the database with psql -U nfldb nfldb < add-old-team-names.sql and add the teams to your team.py file. Mine is located at ~/.local/lib/python2.7/site-packages/nfldb/team.py
  3. Update the schedule.json file by running python update_sched.py --build-old ../build-old-scripts/nfl-schedules
  4. Optionally, update the batch size in your schedule update script or just let step 5 run overnight. Mine is located at /.local/lib/python2.7/site-packages/nfldb/update.py
  5. Run nfldb-update Mine is located at ~/.local/bin/nfldb-update
  6. If you did step 4 it's a good idea to change it back to a batch size of 5
  7. Update the scores with psql -U nfldb nfldb < nfl-update-old-scores.sql
  8. See if it worked by printing out the average score per year python print-avg-ppg-per-year.py You should get the following output:
Year: 1970 # of Games: 182 Avg Points Per Game: 38.52
Year: 1971 # of Games: 182 Avg Points Per Game: 38.73
Year: 1972 # of Games: 182 Avg Points Per Game: 40.51
Year: 1973 # of Games: 182 Avg Points Per Game: 38.91
Year: 1974 # of Games: 182 Avg Points Per Game: 36.36
Year: 1975 # of Games: 182 Avg Points Per Game: 41.18
Year: 1976 # of Games: 196 Avg Points Per Game: 38.31
Year: 1977 # of Games: 196 Avg Points Per Game: 34.35
Year: 1978 # of Games: 224 Avg Points Per Game: 36.67
Year: 1979 # of Games: 224 Avg Points Per Game: 40.13
Year: 1980 # of Games: 224 Avg Points Per Game: 40.97
Year: 1981 # of Games: 224 Avg Points Per Game: 41.35
Year: 1982 # of Games: 98 Avg Points Per Game: 41.18
Year: 1983 # of Games: 224 Avg Points Per Game: 43.66
Year: 1984 # of Games: 224 Avg Points Per Game: 42.42
Year: 1985 # of Games: 224 Avg Points Per Game: 43.06
Year: 1986 # of Games: 224 Avg Points Per Game: 41.04
Year: 1987 # of Games: 210 Avg Points Per Game: 43.20
Year: 1988 # of Games: 224 Avg Points Per Game: 40.51
Year: 1989 # of Games: 224 Avg Points Per Game: 41.21
Year: 1990 # of Games: 224 Avg Points Per Game: 40.25
Year: 1991 # of Games: 224 Avg Points Per Game: 37.97
Year: 1992 # of Games: 224 Avg Points Per Game: 37.46
Year: 1993 # of Games: 210 Avg Points Per Game: 37.09
Year: 1994 # of Games: 224 Avg Points Per Game: 40.51
Year: 1995 # of Games: 240 Avg Points Per Game: 42.98
Year: 1996 # of Games: 240 Avg Points Per Game: 40.85
Year: 1997 # of Games: 240 Avg Points Per Game: 41.49
Year: 1998 # of Games: 240 Avg Points Per Game: 42.56
Year: 1999 # of Games: 248 Avg Points Per Game: 41.63
Year: 2000 # of Games: 248 Avg Points Per Game: 41.35
Year: 2001 # of Games: 248 Avg Points Per Game: 40.42
Year: 2002 # of Games: 256 Avg Points Per Game: 43.35
Year: 2003 # of Games: 256 Avg Points Per Game: 41.66
Year: 2004 # of Games: 256 Avg Points Per Game: 42.97
Year: 2005 # of Games: 256 Avg Points Per Game: 41.23
Year: 2006 # of Games: 256 Avg Points Per Game: 41.32
Year: 2007 # of Games: 256 Avg Points Per Game: 43.38
Year: 2008 # of Games: 256 Avg Points Per Game: 44.06
Year: 2009 # of Games: 256 Avg Points Per Game: 42.93
Year: 2010 # of Games: 256 Avg Points Per Game: 44.07
Year: 2011 # of Games: 256 Avg Points Per Game: 44.36
Year: 2012 # of Games: 256 Avg Points Per Game: 45.51
Year: 2013 # of Games: 256 Avg Points Per Game: 46.82
Year: 2014 # of Games: 256 Avg Points Per Game: 45.18
andr3w321 commented 9 years ago

FYI for anyone who does this I would recommend doing on a new database or after you do it renaming nfldb as it kind of breaks nfldb-update which will try to update all 9000+ games and takes much longer than usual.

jasonbio commented 9 years ago

Hey, I'm working on a project with NFL stats at http://statstrac.com/. Just wanted to leave a comment that if you do import old score data and don't want to rename or make a new nfldb, just revert your update_sched.py back to the original, then rebuild schedule.json with python update_sched.py --rebuild. Finally, edit update games_scheduled in site-packages/nfldb/update.py to: (adding select for game.season_year and ignoring if older than 2009)

def games_scheduled(cursor):
"""
Returns a list of GSIS identifiers corresponding to games that
have schedule data in the database but don't have any drives or
plays in the database. In the typical case, this corresponds to
games that haven't started yet.

The list is sorted in the order in which the games will be played.
"""
scheduled = []
cursor.execute('''
    SELECT DISTINCT game.gsis_id, game.start_time, game.season_year
    FROM game
    LEFT JOIN drive
    ON game.gsis_id = drive.gsis_id
    WHERE drive.drive_id IS NULL
''')
for row in cursor.fetchall():
    # This condition guards against unnecessarily processing games
    # that have only schedule data but aren't even close to starting yet.
    # Namely, if a game doesn't have any drives, then there's nothing to
    # bulk insert.
    #
    # We start looking at games when it's 15 minutes before game time.
    # Eventually, the game will start, and the first bits of drive/play
    # data will be bulk inserted. On the next database update, the game
    # will move to the `games_in_progress` list and updated incrementally.
    #
    # So what's the point of bulk inserting? It's useful when updates are
    # done infrequently (like the initial load of the database or say,
    # once a week).
    if seconds_delta(row['start_time'] - nfldb.now()) < 900 and row['season_year'] >= 2009:
        scheduled.append(row['gsis_id'])
return sorted(scheduled, key=int)

and that should do it :) Now you can safely run nfldb-update without worrying about an insanely long execution time and all the imported scores/legacy team names being overwritten. It's just nice to have everything in one DB

andr3w321 commented 9 years ago

Thanks for the fix. The only problem I found is that whenever I upgrade nfldb update.py gets overritten so I wrote a shell script that I run after running pip installl nfldb --upgrade --user that automatically makes the changes.

#!/bin/bash
sed -i "s/SELECT\ DISTINCT\ game.gsis_id,\ game.start_time$/SELECT\ DISTINCT\ game.gsis_id,\ game.start_time,\ game.season_year/g" ~/.local/lib/python2.7/site-packages/nfldb/update.py
sed -i "s/if\ seconds_delta(row\['start_time']\ -\ nfldb.now())\ <\ 900:/if\ seconds_delta(row\['start_time']\ -\ nfldb.now())\ <\ 900\ and\ row\['season_year']\ >=\ 2009:/g" ~/.local/lib/python2.7/site-packages/nfldb/update.py
sansbacon commented 7 years ago

If you want data before 2009, you can purchase it from armchairanalysis for $49 - they go back to 2000. It looks like the owner used VBA to parse the HTML pages from the pre-gamecenter JSON days.