Open iliketowel opened 10 years ago
play_players
have a profile_id
associated with them that you could piece together the URL.
import nfldb
url_start = 'http://www.nfl.com/players/profile?id='
db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=2014, season_type='Preseason', week=0)
for pp in q.limit(20).as_aggregate():
print('%s: %s%s' % (pp.player, url_start, pp.player.profile_id))
Brian Moorman (BUF, P): http://www.nfl.com/players/profile?id=2502195
Josh Brown (NYG, K): http://www.nfl.com/players/profile?id=2505459
Eli Manning (NYG, QB): http://www.nfl.com/players/profile?id=2505996
Antrel Rolle (NYG, SS): http://www.nfl.com/players/profile?id=2506347
Mario Williams (BUF, DE): http://www.nfl.com/players/profile?id=2495982
Daniel Fells (NYG, TE): http://www.nfl.com/players/profile?id=2506619
Steve Weatherford (NYG, P): http://www.nfl.com/players/profile?id=2506821
Fred Jackson (BUF, RB): http://www.nfl.com/players/profile?id=2506871
Manny Lawson (BUF, DE): http://www.nfl.com/players/profile?id=2495885
Mathias Kiwanuka (NYG, DE): http://www.nfl.com/players/profile?id=2495879
Kyle Williams (BUF, DT): http://www.nfl.com/players/profile?id=2506931
Dan Carpenter (BUF, K): http://www.nfl.com/players/profile?id=2507401
Keith Rivers (BUF, OLB): http://www.nfl.com/players/profile?id=302
Dominique Rodgers-Cromartie (NYG, CB): http://www.nfl.com/players/profile?id=306
Mario Manningham (NYG, WR): http://www.nfl.com/players/profile?id=1030
Quintin Demps (NYG, FS): http://www.nfl.com/players/profile?id=1974
Zack Bowman (NYG, CB): http://www.nfl.com/players/profile?id=2507484
Kellen Davis (NYG, TE): http://www.nfl.com/players/profile?id=2507486
Landon Cohen (BUF, DE): http://www.nfl.com/players/profile?id=4499
Peyton Hillis (NYG, RB): http://www.nfl.com/players/profile?id=1980
I do not know how they decide the folder structure of the actual photos, though, so I think you'd have to scrape those yourself.
Yeah, it's interesting, the profileIDs are different than the photo ID they are attached to. I would actually be able to bring it over directly if the ID in the image was the same. As it stands, I can bring over the page, but I have no way of tying the image.
I agree this would be a nice addition. It's an easy but tedious change that requires
nflgame.update_players
to scrape the image URL.nflgame.Player
class to add a new instance variable.nfldb
to support the new nflgame
field. (Includes adding it to nfldb.Player
class and adding a new database column in nfldb/db.py
.)You're welcome to take a crack at it, otherwise you have two choices:
I should caution you: unlike statistical data, images are copyrighted content (as are logos and video footage). Therefore, it isn't a good idea to show them on a public web site. If it's just for your personal private use, then you're OK.
Thanks, I may give it a shot. One question though, is this something I would have to pull from the url, or would this also exist somewhere in the GSIS data?
I don't know what the "GSIS data" is. Do you mean the NFL.com gamecenter feed? There is virtually no player data in the JSON feed other than an abbreviated name (e.g., T.Brady
) and a player gsis id. All other player meta data is scraped. The image URL would also have to be scraped.
@iliketowel I would very much encourage you to take a crack it. I will happily mentor you through it. The easiest way is to log on to IRC/FreeNode at #nflgame
and mention my nick burntsushi
. During the week, I'm usually on in the evening 5-8/9pm EST. The weekend is hit or miss (this one is bad, next is better).
Otherwise, we could do it over the issue tracker or via email.
I'm going to give it a shot. I'll let you know when i run into issues. But it probably won't be until tomorrow at the earliest.
The photo URL isn't actually as cryptic as I once believed. Included in the source of the player profile page, residing right along side the GSIS ID
is an ESB ID
. This is the ID that is used to generate the photo URL.
For Dominique Rodgers-Cromartie, the following can be found:
<!--
Player Info
ESB ID: ROD616216
GSIS ID: 00-0026156
-->
The photo URL looks like
import nfldb
db = nfldb.connect()
def photo_url(player, esb_id):
photo_url_start = 'http://static.nfl.com/static/content/public/static/img/getty/headshot/'
a, b, c = player.last_name[:3].upper()
return '%s%s/%s/%s/%s.jpg' % (photo_url_start, a, b, c, esb_id)
player, _ = nfldb.player_search(db, 'Dominique Rodgers-Cromartie')
print('%s: %s' % (player.full_name, photo_url(player, 'ROD616216')))
Testing this out sight unseen has been successful for a number of players, though your mileage may vary.
For example, Tom Brady's profile page pp.player.profile_url
= http://www.nfl.com/player/tombrady/2504211/profile
takes us to a page where we can see
<!--
Player Info
ESB ID: BRA371156
GSIS ID: 00-0019596
-->
Grabbing that ESB ID
and punching it into the above gives us:
Tom Brady: http://static.nfl.com/static/content/public/static/img/getty/headshot/B/R/A/BRA371156.jpg
which takes us right to the Tom Terrific's beautiful mug.
It would probably be worthwhile to extract both pieces, in case there are some images that don't follow the pattern.
I've also seen the ESB id used in other places (I think the XML gamebook files).
The ESB ID, is equivalent to the actual Profile_ID on the nfl.com profile pages, the link to the images with the ESBID is actually quite simple. It's the image is always http://static.nfl.com/static/content/public/static/img/getty/headshot/(1st Letter of Last Name)/(2nd Letter of Last Name)/ (3rd Letter of Last Name)/ESBID.jpg
What I've been struggling with is how to either pull that ID from the data the way that you pull the rest of the information in the script, or how to add the ESBID into the script.
@ochawkeye URK! Beware. Both of those functions issue a new request. You really don't want to do that. Ideally you'd issue one request and retrieve all information possible.
@iliketowel I will try to write something up that will guide you. In the mean time, forget about nflgame. Instead, pick a profile page that has an image, read the documentation for beautifulsoup4
and try to write a Python program that extracts the image URL from it. You should need to use nflgame
at all for this:
import bs4
import requests
html = requests.get('profile_url').read()
soup = bs4.soup(html)
# do stuff with soup (see beautifulsoup4 doco for examples)
(That won't work verbatim. I'm just sketching out pseudo code and I probably got the function names wrong.)
redacted :) Another example of knowing only enough to be dangerous!
You should need to use nflgame at all for this:
I'm just confirming here, you mean "Should Not", right?
Whoops, sorry, yes you're right. Start without nflgame
. Of course, we'll eventually get it back into nflgame
(and nfldb
) proper, but it will be simpler and less dangerous this way. :-)
(If you work on nflgame-update-players
directly, then it isn't hard to end up launching thousands of requests to NFL.com. This is OK when you intend to do it, but doing it a lot accidentally is probably not a good idea.)
@ochawkeye URK! Beware. Both of those functions issue a new request. You really don't want to do that. Ideally you'd issue one request and retrieve all information possible.
I know I'm over my skis here, but my new function to collect both GSIS ID
and ESB ID
.
def gsis_and_esb_ids(profile_url):
resp, content = new_http().request(profile_url, 'GET')
if resp['status'] != '200':
return None, None
gid, esb = None, None
m = re.search('GSIS\s+ID:\s+([0-9-]+)', content)
n = re.search('ESB\s+ID:\s+([A-Z][A-Z][A-Z][0-9]+)', content)
if m is not None:
gid = m.group(1).strip()
if n is not None:
esb = n.group(1).strip()
if len(gid) != 10:
gid = None
if len(esb) != 9:
esb = None
return gid, esb
def run():
...
if len(purls) > 0:
eprint('Fetching GSIS and ESB identifiers for players not in nflgame...')
def fetch(purl):
gid, esb = gsis_and_esb_ids(purl)
return purl, gid, esb
for i, (purl, gid, esb) in enumerate(pool.imap(fetch, purls), 1):
progress(i, len(purls))
That looks pretty reasonable, although I'd probably use a looser regex:
ESB\s+ID:\s+([A-Z0-9]+)
In my experience, NFL.com isn't always terribly consistent with their identifiers...
import bs4 import requests
html = requests.get('profile_url').read() soup = bs4.soup(html)
do stuff with soup (see beautifulsoup4 doco for examples)
I'm clearly doing something wrong. Because I get an error as soon as I try to do
import requests Traceback (most recent call last): File "<pyshell#4>", line 1, in
import requests ImportError: No module named requests
I installed beautifulsoup4 when I installed nfldb, but is there some other sort of install I need to do separately?
When there is an ImportError
, it means that Python cannot find a module with the name that you tried to import. Typically, this means you have not installed that module. (On occasion, it means your environment is misconfigured.)
In this scenario, it's likely that you simply haven't installed requests
. In the Python world, we use a tool called pip
to install and manage Python modules. pip
is by default configured to install packages from PyPI. You can search for packages there: https://pypi.python.org/pypi --- Try searching for requests
.
Each search result is a package you can install. The package name is what you can use to install it with pip
. This search is instructive because there are several related results and the first result, drequests
, is not the right one. Instead, you need to look at the description and see if it makes sense with respect to what you're trying to do. In this case, the description for drequests
says that it is a web application framework. Are we building a web app? Nope. Next. OK, now we see requests
and it says it is "Python HTTP for Humans." Not a terribly great description, but we are using it to download web pages, which works over the HTTP protocol. Plus, the package name requests
matches the module name we want to import, requests
. (This is not always true!!!!)
So once we think we know the package we want, it's time to install it, just like you installed nfldb
:
pip install requests
And then you should be able to run python -c "import requests"
successfully.
So, I'm still on the first part. I got as far as this:
from bs4 import BeautifulSoup import requests import re
def get_soup(url):
request = requests.get(url).content return BeautifulSoup(request)
url = "http://www.nfl.com/player/ejmanuel/2539228/profile" soup = get_soup(url) bimg = re.compile('.http://static.nfl.com/static/content/public/static/img/getty/headshot') img_links = soup.find_all("img", {'src': bimg}) for link in img_links:
print link
Which prints the link:
img height="90" onerror="if (this.src != 'http://i.nflcdn.com/static/site/img/sr_pic0.gif') {this.src='http://i.nflcdn.com/static/site/img/sr_pic0.gif'}" src="http://static.nfl.com/static/content/public/static/img/getty/headshot/M/A/N/MAN738705.jpg" width="65"/>
But, I don't know how to pull out only the "MAN738705" (or 738705)?
I think if you print link.src
, it will show you the URL. Then you can pull it out with a regex:
import re
s = "http://static.nfl.com/static/content/public/static/img/getty/headshot/M/A/N/MAN738705.jpg"
m = re.search('([^/]+)\.[^/]+$', s)
print m.group(1)
Output: MAN738705
.
Okay, so, I'm not clear on the next step. I have this ability to create a static pull of the ID, but how do I do that dynamically for all of the players. I suspect it's something about the website, but I'm not sure what.
@iliketowel That piece you thankfully don't need to worry about. The nflgame/update_players.py will actually do it for you.
The next step is to take the code you used to extract the ID from the HTML and merge it into nflgame/update_players.py
. My guess is that you'll want to modify the gsis_id function so that it gets more than the GSIS ID from the page. For example, here's the current code:
def gsis_id(profile_url):
resp, content = new_http().request(profile_url, 'GET')
if resp['status'] != '200':
return None
m = re.search('GSIS\s+ID:\s+([0-9-]+)', content)
if m is None:
return None
gid = m.group(1).strip()
if len(gid) != 10: # Can't be valid...
return None
return gid
Here's what you might want to do: (notice the name change of the function!)
def nfl_ids_for_player(profile_url):
resp, content = new_http().request(profile_url, 'GET')
if resp['status'] != '200':
return None
m = re.search('GSIS\s+ID:\s+([0-9-]+)', content)
if m is None:
return None
gid = m.group(1).strip()
if len(gid) != 10: # Can't be valid...
return None
# Your code goes here...
soup = ...
esb_id = ...
return {'gsis_id': gid, 'esb_id': esb_id}
So at this point, I started going deeper (because you have to change the places where gsis_id
is called to deal with the new return value), but I quickly realized that it is probably not a good use of your time. The update_players.py
script is grossly over complicated because it goes to dramatic lengths to keep the number of requests to NFL.com to a minimum. (During the season, running the script often results in no requests at all!)
If you could do the above and submit a pull request to the nflgame
repository (not nfldb
), then I think I'll be able to handle the rest. :-)
@iliketowel @BurntSushi I'm not sure where this ended up, or if it went offline or whut, but I'm in the market for just this thing.
I know it's over 2 years old, but I'd be happy to help contribute where possible to get something working. For personal use, of course.
FWIW I've been using this data just locally in a Postgres DB and found a pretty straight forward way to inject the ESB IDs using some modification of the above code and psycopg2. From that I can just apply a generic URL to have the avatars render wherever I query it. I'm not sure anyone's interested in my janky Python code but the above references were super helpful getting it working.
I'm not sure if this is data available in the database (or if there is even a 'profile' table). But I notice that all the players have a link to their webpage (profile_url)(http://www.nfl.com/player/playername/playerid/profile. I wanted to pull in the image that's connected for all players (http://static.nfl.com/static/content/public/static/img/getty/headshot/K/A/E/KAE371576.jpg) (For Colin Kaepernick, for example), I'm curious if there is something that isn't currently brought in that would have that information. I was hoping to use this in my dashboard