BurntSushi / nfldb

A library to manage and update NFL data in a relational database.
The Unlicense
1.08k stars 264 forks source link

Link to images #44

Open iliketowel opened 10 years ago

iliketowel commented 10 years ago

I'm not sure if this is data available in the database (or if there is even a 'profile' table). But I notice that all the players have a link to their webpage (profile_url)(http://www.nfl.com/player/playername/playerid/profile. I wanted to pull in the image that's connected for all players (http://static.nfl.com/static/content/public/static/img/getty/headshot/K/A/E/KAE371576.jpg) (For Colin Kaepernick, for example), I'm curious if there is something that isn't currently brought in that would have that information. I was hoping to use this in my dashboard

ochawkeye commented 10 years ago

play_players have a profile_id associated with them that you could piece together the URL.

import nfldb

url_start = 'http://www.nfl.com/players/profile?id='

db = nfldb.connect()
q = nfldb.Query(db) 
q.game(season_year=2014, season_type='Preseason', week=0)
for pp in q.limit(20).as_aggregate():
    print('%s: %s%s' % (pp.player, url_start, pp.player.profile_id))
Brian Moorman (BUF, P): http://www.nfl.com/players/profile?id=2502195
Josh Brown (NYG, K): http://www.nfl.com/players/profile?id=2505459
Eli Manning (NYG, QB): http://www.nfl.com/players/profile?id=2505996
Antrel Rolle (NYG, SS): http://www.nfl.com/players/profile?id=2506347
Mario Williams (BUF, DE): http://www.nfl.com/players/profile?id=2495982
Daniel Fells (NYG, TE): http://www.nfl.com/players/profile?id=2506619
Steve Weatherford (NYG, P): http://www.nfl.com/players/profile?id=2506821
Fred Jackson (BUF, RB): http://www.nfl.com/players/profile?id=2506871
Manny Lawson (BUF, DE): http://www.nfl.com/players/profile?id=2495885
Mathias Kiwanuka (NYG, DE): http://www.nfl.com/players/profile?id=2495879
Kyle Williams (BUF, DT): http://www.nfl.com/players/profile?id=2506931
Dan Carpenter (BUF, K): http://www.nfl.com/players/profile?id=2507401
Keith Rivers (BUF, OLB): http://www.nfl.com/players/profile?id=302
Dominique Rodgers-Cromartie (NYG, CB): http://www.nfl.com/players/profile?id=306
Mario Manningham (NYG, WR): http://www.nfl.com/players/profile?id=1030
Quintin Demps (NYG, FS): http://www.nfl.com/players/profile?id=1974
Zack Bowman (NYG, CB): http://www.nfl.com/players/profile?id=2507484
Kellen Davis (NYG, TE): http://www.nfl.com/players/profile?id=2507486
Landon Cohen (BUF, DE): http://www.nfl.com/players/profile?id=4499
Peyton Hillis (NYG, RB): http://www.nfl.com/players/profile?id=1980

I do not know how they decide the folder structure of the actual photos, though, so I think you'd have to scrape those yourself.

iliketowel commented 10 years ago

Yeah, it's interesting, the profileIDs are different than the photo ID they are attached to. I would actually be able to bring it over directly if the ID in the image was the same. As it stands, I can bring over the page, but I have no way of tying the image.

BurntSushi commented 10 years ago

I agree this would be a nice addition. It's an easy but tedious change that requires

  1. Modifying nflgame.update_players to scrape the image URL.
  2. Regenerating the full player database from scratch.
  3. Updating the nflgame.Player class to add a new instance variable.
  4. Updating nfldb to support the new nflgame field. (Includes adding it to nfldb.Player class and adding a new database column in nfldb/db.py.)

You're welcome to take a crack at it, otherwise you have two choices:

  1. Use the profile URL to scrape the images yourself.
  2. Wait for me to implement it. (It isn't on my path of things I want to do so I don't know when I'll do it.)

I should caution you: unlike statistical data, images are copyrighted content (as are logos and video footage). Therefore, it isn't a good idea to show them on a public web site. If it's just for your personal private use, then you're OK.

iliketowel commented 10 years ago

Thanks, I may give it a shot. One question though, is this something I would have to pull from the url, or would this also exist somewhere in the GSIS data?

BurntSushi commented 10 years ago

I don't know what the "GSIS data" is. Do you mean the NFL.com gamecenter feed? There is virtually no player data in the JSON feed other than an abbreviated name (e.g., T.Brady) and a player gsis id. All other player meta data is scraped. The image URL would also have to be scraped.

BurntSushi commented 10 years ago

@iliketowel I would very much encourage you to take a crack it. I will happily mentor you through it. The easiest way is to log on to IRC/FreeNode at #nflgame and mention my nick burntsushi. During the week, I'm usually on in the evening 5-8/9pm EST. The weekend is hit or miss (this one is bad, next is better).

Otherwise, we could do it over the issue tracker or via email.

iliketowel commented 10 years ago

I'm going to give it a shot. I'll let you know when i run into issues. But it probably won't be until tomorrow at the earliest.

ochawkeye commented 10 years ago

The photo URL isn't actually as cryptic as I once believed. Included in the source of the player profile page, residing right along side the GSIS ID is an ESB ID. This is the ID that is used to generate the photo URL.

For Dominique Rodgers-Cromartie, the following can be found:

    <!--
    Player Info
    ESB ID: ROD616216
    GSIS ID: 00-0026156
     -->

The photo URL looks like

import nfldb
db = nfldb.connect()

def photo_url(player, esb_id):
    photo_url_start = 'http://static.nfl.com/static/content/public/static/img/getty/headshot/'
    a, b, c = player.last_name[:3].upper()
    return '%s%s/%s/%s/%s.jpg' % (photo_url_start, a, b, c, esb_id)

player, _ = nfldb.player_search(db, 'Dominique Rodgers-Cromartie')
print('%s: %s' % (player.full_name, photo_url(player, 'ROD616216')))

Testing this out sight unseen has been successful for a number of players, though your mileage may vary.

For example, Tom Brady's profile page pp.player.profile_url = http://www.nfl.com/player/tombrady/2504211/profile takes us to a page where we can see

    <!--
    Player Info
    ESB ID: BRA371156
    GSIS ID: 00-0019596
     -->

Grabbing that ESB ID and punching it into the above gives us:

Tom Brady: http://static.nfl.com/static/content/public/static/img/getty/headshot/B/R/A/BRA371156.jpg

which takes us right to the Tom Terrific's beautiful mug.

BurntSushi commented 10 years ago

It would probably be worthwhile to extract both pieces, in case there are some images that don't follow the pattern.

I've also seen the ESB id used in other places (I think the XML gamebook files).

iliketowel commented 10 years ago

The ESB ID, is equivalent to the actual Profile_ID on the nfl.com profile pages, the link to the images with the ESBID is actually quite simple. It's the image is always http://static.nfl.com/static/content/public/static/img/getty/headshot/(1st Letter of Last Name)/(2nd Letter of Last Name)/ (3rd Letter of Last Name)/ESBID.jpg

What I've been struggling with is how to either pull that ID from the data the way that you pull the rest of the information in the script, or how to add the ESBID into the script.

BurntSushi commented 10 years ago

@ochawkeye URK! Beware. Both of those functions issue a new request. You really don't want to do that. Ideally you'd issue one request and retrieve all information possible.

@iliketowel I will try to write something up that will guide you. In the mean time, forget about nflgame. Instead, pick a profile page that has an image, read the documentation for beautifulsoup4 and try to write a Python program that extracts the image URL from it. You should need to use nflgame at all for this:

import bs4
import requests

html = requests.get('profile_url').read()
soup = bs4.soup(html)

# do stuff with soup (see beautifulsoup4 doco for examples)

(That won't work verbatim. I'm just sketching out pseudo code and I probably got the function names wrong.)

ochawkeye commented 10 years ago

redacted :) Another example of knowing only enough to be dangerous!

iliketowel commented 10 years ago

You should need to use nflgame at all for this:

I'm just confirming here, you mean "Should Not", right?

BurntSushi commented 10 years ago

Whoops, sorry, yes you're right. Start without nflgame. Of course, we'll eventually get it back into nflgame (and nfldb) proper, but it will be simpler and less dangerous this way. :-)

(If you work on nflgame-update-players directly, then it isn't hard to end up launching thousands of requests to NFL.com. This is OK when you intend to do it, but doing it a lot accidentally is probably not a good idea.)

ochawkeye commented 10 years ago

@ochawkeye URK! Beware. Both of those functions issue a new request. You really don't want to do that. Ideally you'd issue one request and retrieve all information possible.

I know I'm over my skis here, but my new function to collect both GSIS ID and ESB ID.

def gsis_and_esb_ids(profile_url):
    resp, content = new_http().request(profile_url, 'GET')
    if resp['status'] != '200':
        return None, None
    gid, esb = None, None
    m = re.search('GSIS\s+ID:\s+([0-9-]+)', content)
    n = re.search('ESB\s+ID:\s+([A-Z][A-Z][A-Z][0-9]+)', content)
    if m is not None:
        gid = m.group(1).strip()
    if n is not None:
        esb = n.group(1).strip()
    if len(gid) != 10:
        gid = None
    if len(esb) != 9:
        esb = None
    return gid, esb

def run():
...
    if len(purls) > 0:
        eprint('Fetching GSIS and ESB identifiers for players not in nflgame...')

        def fetch(purl):
            gid, esb = gsis_and_esb_ids(purl)
            return purl, gid, esb
        for i, (purl, gid, esb) in enumerate(pool.imap(fetch, purls), 1):
            progress(i, len(purls))
BurntSushi commented 10 years ago

That looks pretty reasonable, although I'd probably use a looser regex:

ESB\s+ID:\s+([A-Z0-9]+)

In my experience, NFL.com isn't always terribly consistent with their identifiers...

iliketowel commented 10 years ago

import bs4 import requests

html = requests.get('profile_url').read() soup = bs4.soup(html)

do stuff with soup (see beautifulsoup4 doco for examples)

I'm clearly doing something wrong. Because I get an error as soon as I try to do

import requests Traceback (most recent call last): File "<pyshell#4>", line 1, in import requests ImportError: No module named requests

I installed beautifulsoup4 when I installed nfldb, but is there some other sort of install I need to do separately?

BurntSushi commented 10 years ago

When there is an ImportError, it means that Python cannot find a module with the name that you tried to import. Typically, this means you have not installed that module. (On occasion, it means your environment is misconfigured.)

In this scenario, it's likely that you simply haven't installed requests. In the Python world, we use a tool called pip to install and manage Python modules. pip is by default configured to install packages from PyPI. You can search for packages there: https://pypi.python.org/pypi --- Try searching for requests.

Each search result is a package you can install. The package name is what you can use to install it with pip. This search is instructive because there are several related results and the first result, drequests, is not the right one. Instead, you need to look at the description and see if it makes sense with respect to what you're trying to do. In this case, the description for drequests says that it is a web application framework. Are we building a web app? Nope. Next. OK, now we see requests and it says it is "Python HTTP for Humans." Not a terribly great description, but we are using it to download web pages, which works over the HTTP protocol. Plus, the package name requests matches the module name we want to import, requests. (This is not always true!!!!)

So once we think we know the package we want, it's time to install it, just like you installed nfldb:

pip install requests

And then you should be able to run python -c "import requests" successfully.

iliketowel commented 10 years ago

So, I'm still on the first part. I got as far as this:

from bs4 import BeautifulSoup import requests import re

def get_soup(url):

request = requests.get(url).content return BeautifulSoup(request)

url = "http://www.nfl.com/player/ejmanuel/2539228/profile" soup = get_soup(url) bimg = re.compile('.http://static.nfl.com/static/content/public/static/img/getty/headshot') img_links = soup.find_all("img", {'src': bimg}) for link in img_links:

print link

Which prints the link:

img height="90" onerror="if (this.src != 'http://i.nflcdn.com/static/site/img/sr_pic0.gif') {this.src='http://i.nflcdn.com/static/site/img/sr_pic0.gif'}" src="http://static.nfl.com/static/content/public/static/img/getty/headshot/M/A/N/MAN738705.jpg" width="65"/>

But, I don't know how to pull out only the "MAN738705" (or 738705)?

BurntSushi commented 10 years ago

I think if you print link.src, it will show you the URL. Then you can pull it out with a regex:

import re

s = "http://static.nfl.com/static/content/public/static/img/getty/headshot/M/A/N/MAN738705.jpg"
m = re.search('([^/]+)\.[^/]+$', s)
print m.group(1)

Output: MAN738705.

iliketowel commented 10 years ago

Okay, so, I'm not clear on the next step. I have this ability to create a static pull of the ID, but how do I do that dynamically for all of the players. I suspect it's something about the website, but I'm not sure what.

BurntSushi commented 10 years ago

@iliketowel That piece you thankfully don't need to worry about. The nflgame/update_players.py will actually do it for you.

The next step is to take the code you used to extract the ID from the HTML and merge it into nflgame/update_players.py. My guess is that you'll want to modify the gsis_id function so that it gets more than the GSIS ID from the page. For example, here's the current code:

def gsis_id(profile_url):
    resp, content = new_http().request(profile_url, 'GET')
    if resp['status'] != '200':
        return None
    m = re.search('GSIS\s+ID:\s+([0-9-]+)', content)
    if m is None:
        return None
    gid = m.group(1).strip()
    if len(gid) != 10:  # Can't be valid...
        return None
    return gid

Here's what you might want to do: (notice the name change of the function!)

def nfl_ids_for_player(profile_url):
    resp, content = new_http().request(profile_url, 'GET')
    if resp['status'] != '200':
        return None
    m = re.search('GSIS\s+ID:\s+([0-9-]+)', content)
    if m is None:
        return None
    gid = m.group(1).strip()
    if len(gid) != 10:  # Can't be valid...
        return None

    # Your code goes here...
    soup = ...
    esb_id = ...
    return {'gsis_id': gid, 'esb_id': esb_id}

So at this point, I started going deeper (because you have to change the places where gsis_id is called to deal with the new return value), but I quickly realized that it is probably not a good use of your time. The update_players.py script is grossly over complicated because it goes to dramatic lengths to keep the number of requests to NFL.com to a minimum. (During the season, running the script often results in no requests at all!)

If you could do the above and submit a pull request to the nflgame repository (not nfldb), then I think I'll be able to handle the rest. :-)

seanlhiggins commented 7 years ago

@iliketowel @BurntSushi I'm not sure where this ended up, or if it went offline or whut, but I'm in the market for just this thing.

I know it's over 2 years old, but I'd be happy to help contribute where possible to get something working. For personal use, of course.

seanlhiggins commented 7 years ago

FWIW I've been using this data just locally in a Postgres DB and found a pretty straight forward way to inject the ESB IDs using some modification of the above code and psycopg2. From that I can just apply a generic URL to have the avatars render wherever I query it. I'm not sure anyone's interested in my janky Python code but the above references were super helpful getting it working.