andrewgiessel / basketballcrawler

DEFUNCT - This is a python module to scrape basketball-reference.com and convert various stats into usable data structures for analysis
123 stars 56 forks source link

buildPlayerDictionary is failing (out-of-date?) #2

Closed brian-k closed 6 years ago

brian-k commented 10 years ago

Trying to build the player dictionary may not be creating the dictionary correctly. The function is only capturing 155 players. My hunch is that basketball-reference changed their web page structure, which breaks buildPlayerDictionary() scraping.

After running the code:

import warnings
warnings.filterwarnings('ignore')
import basketballCrawler as bc
import pandas
playerJson = bc.buildPlayerDictionary()
bc.savePlayerDictionary(playerJson, "/path/to/playerJson.bk.json")

The resulting JSON is:

{
"Mike James": {
"overview_url_content": "Mike James NBA & ABA Stats | Basketball-Reference.comvar sr_gzipEnabled = false; var sr_js_loader = new Array();\n(function () {var sr_css_file = 'http://d2ft4b0ve1aur1.cloudfront.net/css-416/sr-bbr-min.css';if (sr_gzipEnabled) {sr_css_file = 'http://d2ft4b0ve1aur1.cloudfront.net/css-416/sr-bbr-min-gz.css';}var head = document.getElementsByTagName(\"head\")[0];if (head) {var scriptStyles = document.createElement(\"link\");scriptStyles.rel = \"stylesheet\";scriptStyles.type = \"text/css\";scriptStyles.href = sr_css_file;head.appendChild(scriptStyles);}}());\n\n/* * JS Redirection Mobile\n*\n* Developed and presumably Copyright by\n* Sebastiano Armeli-Battana (@sebarmeli) - http://www.sebastianoarmelibattana.com\n* Release under the MIT licence */\nif(!window.SA){window.SA={}}SA.redirection_mobile=function(f){var j=function(g){var b=new Date();b.setTime(b.getTime()+g);return b};var k=function(b){if(!b){return}var i=document.location.search,p=i&&i.substring(1).split(\"&\"),l=0,n=p.length;for(;l<n;l++){var g=p[l],m=g&&g.substring(0,g.indexOf(\"=\"));if(m===b){return g.substring(g.indexOf(\"=\")+1,g.length)}}};var d=navigator.userAgent.toLowerCase(),fls=\"false\",tr=\"true\",host=document.location.host,f=f||{},mobile_protocol=f.mobile_scheme?f.mobile_scheme+\":\":document.location.protocol,mobile_prefix=f.mobile_prefix||\"m\",mobile_url=f.mobile_url,cookie_hours=f.cookie_hours||1,redirection_param=f.redirection_paramName||\"mobile\",redirection_param_long=f.redirection_paramName_long||\"mobile_long\",mobile_host=mobile_url||mobile_prefix+\".\",queryValue=k(redirection_param),queryValue_long=k(redirection_param_long),isUAMobile=false;if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge |maemo|midp|mmp|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows (ce|phone)|xda|xiino|mobile.+firefox/i.test(d)||/1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i.test(d.substr(0,4))){isUAMobile=true}if(queryValue_long===fls){if(window.localStorage){window.localStorage.setItem(redirection_param_long,fls)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(3600*1000*24*31*cookie_hours).toUTCString()}}else{if((queryValue_long===tr)||(queryValue===tr)){if(window.localStorage){window.localStorage.setItem(redirection_param_long,tr)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(-3600*1000*cookie_hours).toUTCString()}if(window.sessionStorage){window.sessionStorage.setItem(redirection_param,tr)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(-3600*1000*cookie_hours*7*24*31).toUTCString()}}else{if(document.referrer.indexOf(mobile_host)>=0||queryValue===fls){if(window.sessionStorage){window.sessionStorage.setItem(redirection_param,fls)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(3600*1000*cookie_hours).toUTCString()}}}}var a=(window.sessionStorage)?(window.sessionStorage.getItem(redirection_param)===fls):false,e=(window.localStorage)?(window.localStorage.getItem(redirection_param_long)===fls):false,c=document.cookie?(document.cookie.indexOf(redirection_param)>=0):false;if(isUAMobile&&!(c||a||e)){document.location.href=mobile_protocol+\"//\"+mobile_host}};SA.redirection_mobile({mobile_url:\"m.bkref.com/m?p=XXplayersXXjXXjamesmi01.html\",redirection_paramName:\"mobile\",redirection_paramName_long:\"mobile_long\",cookie_hours:6});\nvar googletag = googletag || {};googletag.cmd = googletag.cmd || [];(function() {var gads = document.createElement('script');gads.async = true;gads.type = 'text/javascript';var useSSL = 'https:' == document.location.protocol;gads.src = (useSSL ? 'https:' : 'http:') + '//www.googletagservices.com/tag/js/gpt.js';var node = document.getElementsByTagName('script')[0];node.parentNode.insertBefore(gads, node);})();   \nSports-Reference:\n Baseball \u00b7\n Basketball\n(college) \u00b7\n Football\n(college) \u00b7\n Hockey \u00b7\n Olympics \u00b7\n S-R Blog \u00b7\n Question or Comment?\n\n\n\n\n\n\u00a0\n\nLOGIN\nLOGOUT\nSPONSOR\nAD FREE\n\u00a0\n\n\n\n\n\n\n\n\n\nTips\n\n\n\n\n\n\n\n\n\nplay indexbox scoresplayersteamsseasonscoachesleadersawardsplayoffsdraftolympicsmore [+]Mobile Site You Are Here\u00a0>\u00a0BBR Home\u00a0>\u00a0Players\u00a0>\u00a0J\u00a0>\u00a0Mike JamesNews: s-r blog:2014-15 NBA Schedules Added \n\n\n\n\ngoogletag.cmd.push(function() {\ngoogletag.defineSlot('/5702/yb_Sports_Reference', [[300,250]], 'restofsite_300x250_atf1').addService(googletag.pubads());\n\ngoogletag.enableServices();\n});\n\n\n\ngoogletag.cmd.push(function() { googletag.display('restofsite_300x250_atf1'); });\n\n\n\nMike James\n\n\n\n\nMichael Lamont James (Pit Bull)\u00a0\u25aa\u00a0Twitter: @mikejames7\n\nPosition: Point Guard\u00a0\u25aa\u00a0Shoots: RightHeight: 6-2\u00a0\u25aa\u00a0Weight: 188 lbs.\nBorn: June 23, 1975 in Copaigue, New York\nHigh School: Amityville Memorial in Amityville, New York\nCollege: Duquesne University\nNBA Debut: December 23, 2001\nExperience: 12 yearsD-League: 9 G, 20.6 PPG, 3.1 RPG, 4.4 APG (Full Record)\n\n12\n\n\u20097\u2009\n\n13\n\n\u20097\u2009\n\n13\n\n13\n\n13\n\n13\n\n\u20097\u2009\n\n\u20095\u2009\n\n\u20095\u2009\n\n13\n\n\u20098\u2009\n\n13\n\n\nOther SR Links: College Basketball at Sports-Reference.com\n\n\n\ngoogletag.cmd.push(function() {\ngoogletag.defineSlot('/5702/yb_Sports_Reference', [728,90], 'restofsite_728x90_atf1').addService(googletag.pubads());\n\ngoogletag.enableServices();\n});\n\n\n\ngoogletag.cmd.push(function() { googletag.display('restofsite_728x90_atf1'); });\n\n\n\n\nPromote your website or business by sponsoring this page for $40 on Basketball-Reference.com.  Your message will replace this ad.\n\n\n\n\n\nMike James\nReg. Seas.\n\nTotals\nPer Game\nPer 36 Minutes\nAdvanced\n\n\nPlayoffs\n\nTotals\nPer Game\nPer 36 Minutes\nAdvanced\n\n\nOther\n\nPlayer News\nSim Scores\nCollege\nLeaderboard\nTransactions\nSalaries\nContract\n\n\nGame Logs\n\n2013-14\n2012-13\n2011-12\n2009-10\n2008-09\n2007-08\n2006-07\n2005-06\n2004-05\n2003-04\n2002-03\n2001-02\n\n\nSplits\n\n2013-14\n2012-13\n2011-12\n2009-10\n2008-09\n2007-08\n2006-07\n2005-06\n2004-05\n2003-04\n2002-03\n2001-02\nCareer\n\n\nShooting\n\n2013-14\n2012-13\n2011-12\n2009-10\n2008-09\n2007-08\n2006-07\n2005-06\n2004-05\n2003-04\n2002-03\n2001-02\n\n\nLineups\n\n2013-14\n2012-13\n2011-12\n2009-10\n2008-09\n2007-08\n2006-07\n2005-06\n2004-05\n2003-04\n2002-03\n2001-02\n\n\nOn/Off\n\n2013-14\n2012-13\n2011-12\n2009-10\n2008-09\n2007-08\n2006-07\n2005-06\n2004-05\n2003-04\n2002-03\n2001-02\nCareer\n\n\nFinders\n\nGame Finder\nStreak Finder\nShot Finder\nEvent Finder\nLineup Finder\nPlus/Minus Finder\n\n\n\n\n\nPlayer News\nAdd Your Blog Posts Here\n                           \u00b7 Player News Archive\n                           \u00b7 Player News RSS Feed\n \u00b7 Hide Stories\n\n\n\n9/23 HoopsRumors: Where 2013/14 10-Day Signees Are Today:  More players signed 10-day contracts last season than in any of...\n9/21 HoopsRumors: Trade Retrospective: Vince Carter To Nets:  In the wake of the blockbuster deal that sent Kevin Love to the...\n8/20 HoopsRumors: Undrafted Players In The NBA:  August and September are the months when undrafted players step...\n8/15 HoopsRumors: The Oldest NBA Players Currently Under Contract:  They say 40 is the new 30 and there are a number of NBA free agents...\n8/14 The Smoking Cuban: Perceptions and Expectations of Mavericks:  Think back to the start of the 2013-2014 season. How did you approach...\n\n\n\n\nTotals\n\n\n\n\n\n\n\nSeason\n", 
"overview_url": "http://www.basketball-reference.com/players/j/jamesmi01.html", 
"gamelog_data": null, 
"gamelog_url_list": [
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2014/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2013/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2012/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2010/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2009/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2008/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2007/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2006/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2005/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2004/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2003/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2002/"
]
}, 
"Wayne Ellington": {
"overview_url_content": "Wayne Ellington NBA & ABA Stats | Basketball-Reference.comvar sr_gzipEnabled = false; var sr_js_loader = new Array();\n(function () {var sr_css_file = 'http://d2ft4b0ve1aur1.cloudfront.net/css-416/sr-bbr-min.css';if (sr_gzipEnabled) {sr_css_file = 'http://d2ft4b0ve1aur1.cloudfront.net/css-416/sr-bbr-min-gz.css';}var head = document.getElementsByTagName(\"head\")[0];if (head) {var scriptStyles = document.createElement(\"link\");scriptStyles.rel = \"stylesheet\";scriptStyles.type = \"text/css\";scriptStyles.href = sr_css_file;head.appendChild(scriptStyles);}}());\n\n/* * JS Redirection Mobile\n*\n* Developed and presumably Copyright by\n* Sebastiano Armeli-Battana (@sebarmeli) - http://www.sebastianoarmelibattana.com\n* Release under the MIT licence */\nif(!window.SA){window.SA={}}SA.redirection_mobile=function(f){var j=function(g){var b=new Date();b.setTime(b.getTime()+g);return b};var k=function(b){if(!b){return}var i=document.location.search,p=i&&i.substring(1).split(\"&\"),l=0,n=p.length;for(;l<n;l++){var g=p[l],m=g&&g.substring(0,g.indexOf(\"=\"));if(m===b){return g.substring(g.indexOf(\"=\")+1,g.length)}}};var d=navigator.userAgent.toLowerCase(),fls=\"false\",tr=\"true\",host=document.location.host,f=f||{},mobile_protocol=f.mobile_scheme?f.mobile_scheme+\":\":document.location.protocol,mobile_prefix=f.mobile_prefix||\"m\",mobile_url=f.mobile_url,cookie_hours=f.cookie_hours||1,redirection_param=f.redirection_paramName||\"mobile\",redirection_param_long=f.redirection_paramName_long||\"mobile_long\",mobile_host=mobile_url||mobile_prefix+\".\",queryValue=k(redirection_param),queryValue_long=k(redirection_param_long),isUAMobile=false;if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge |maemo|midp|mmp|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows (ce|phone)|xda|xiino|mobile.+firefox/i.test(d)||/1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i.test(d.substr(0,4))){isUAMobile=true}if(queryValue_long===fls){if(window.localStorage){window.localStorage.setItem(redirection_param_long,fls)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(3600*1000*24*31*cookie_hours).toUTCString()}}else{if((queryValue_long===tr)||(queryValue===tr)){if(window.localStorage){window.localStorage.setItem(redirection_param_long,tr)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(-3600*1000*cookie_hours).toUTCString()}if(window.sessionStorage){window.sessionStorage.setItem(redirection_param,tr)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(-3600*1000*cookie_hours*7*24*31).toUTCString()}}else{if(document.referrer.indexOf(mobile_host)>=0||queryValue===fls){if(window.sessionStorage){window.sessionStorage.setItem(redirection_param,fls)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(3600*1000*cookie_hours).toUTCString()}}}}var a=(window.sessionStorage)?(window.sessionStorage.getItem(redirection_param)===fls):false,e=(window.localStorage)?(window.localStorage.getItem(redirection_param_long)===fls):false,c=document.cookie?(document.cookie.indexOf(redirection_param)>=0):false;if(isUAMobile&&!(c||a||e)){document.location.href=mobile_protocol+\"//\"+mobile_host}};SA.redirection_mobile({mobile_url:\"m.bkref.com/m?p=XXplayersXXeXXellinwa01.html\",redirection_paramName:\"mobile\",redirection_paramName_long:\"mobile_long\",cookie_hours:6});\nvar googletag = googletag || {};googletag.cmd = googletag.cmd || [];(function() {var gads = document.createElement('script');gads.async = true;gads.type = 'text/javascript';var useSSL = 'https:' == document.location.protocol;gads.src = (useSSL ? 'https:' : 'http:') + '//www.googletagservices.com/tag/js/gpt.js';var node = document.getElementsByTagName('script')[0];node.parentNode.insertBefore(gads, node);})();   \nSports-Reference:\n Baseball \u00b7\n Basketball\n(college) \u00b7\n Football\n(college) \u00b7\n Hockey \u00b7\n Olympics \u00b7\n S-R Blog \u00b7\n Question or Comment?\n\n\n\n\n\n\u00a0\n\nLOGIN\nLOGOUT\nSPONSOR\nAD FREE\n\u00a0\n\n\n\n\n\n\n\n\n\nTips\n\n\n\n\n\n\n\n\n\nplay indexbox scoresplayersteamsseasonscoachesleadersawardsplayoffsdraftolympicsmore [+]Mobile Site You Are Here\u00a0>\u00a0BBR Home\u00a0>\u00a0Players\u00a0>\u00a0E\u00a0>\u00a0Wayne EllingtonNews: s-r blog:2014-15 NBA Schedules Added \n\n\n\n\ngoogletag.cmd.push(function() {\ngoogletag.defineSlot('/5702/yb_Sports_Reference', [[300,250]], 'restofsite_300x250_atf1').addService(googletag.pubads());\n\ngoogletag.enableServices();\n});\n\n\n\ngoogletag.cmd.push(function() { googletag.display('restofsite_300x250_atf1'); });\n\n\n-->\n\ngoogletag.cmd.push(function() {\ngoogletag.defineSlot('/5702/yb_Sports_Reference', [[300,250]], 'restofsite_300x250_atf1').addService(googletag.pubads());\n\ngoogletag.enableServices();\n});\n\n\n\ngoogletag.cmd.push(function() { googletag.display('restofsite_300x250_atf1'); });\n\n\n\n\n/* * JS Redirection Mobile\n*\n* Developed and presumably Copyright by\n* Sebastiano Armeli-Battana (@sebarmeli) - http://www.sebastianoarmelibattana.com\n* Release under the MIT licence */\nif(!window.SA){window.SA={}}SA.redirection_mobile=function(f){var j=function(g){var b=new Date();b.setTime(b.getTime()+g);return b};var k=function(b){if(!b){return}var i=document.location.search,p=i&&i.substring(1).split(\"&\"),l=0,n=p.length;for(;l<n;l++){var g=p[l],m=g&&g.substring(0,g.indexOf(\"=\"));if(m===b){return g.substring(g.indexOf(\"=\")+1,g.length)}}};var d=navigator.userAgent.toLowerCase(),fls=\"false\",tr=\"true\",host=document.location.host,f=f||{},mobile_protocol=f.mobile_scheme?f.mobile_scheme+\":\":document.location.protocol,mobile_prefix=f.mobile_prefix||\"m\",mobile_url=f.mobile_url,cookie_hours=f.cookie_hours||1,redirection_param=f.redirection_paramName||\"mobile\",redirection_param_long=f.redirection_paramName_long||\"mobile_long\",mobile_host=mobile_url||mobile_prefix+\".\",queryValue=k(redirection_param),queryValue_long=k(redirection_param_long),isUAMobile=false;if(/(android|bb\\d+|meego).+mobile|avantgo|bada\\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge |maemo|midp|mmp|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\\.(browser|link)|vodafone|wap|windows (ce|phone)|xda|xiino|mobile.+firefox/i.test(d)||/1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\\-(n|u)|c55\\/|capi|ccwa|cdm\\-|cell|chtm|cldc|cmd\\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\\-s|devi|dica|dmob|do(c|p)o|ds(12|\\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\\-|_)|g1 u|g560|gene|gf\\-5|g\\-mo|go(\\.w|od)|gr(ad|un)|haie|hcit|hd\\-(m|p|t)|hei\\-|hi(pt|ta)|hp( i|ip)|hs\\-c|ht(c(\\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\\-(20|go|ma)|i230|iac( |\\-|\\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\\/)|klon|kpt |kwc\\-|kyo(c|k)|le(no|xi)|lg( g|\\/(k|l|u)|50|54|\\-[a-w])|libw|lynx|m1\\-w|m3ga|m50\\/|ma(te|ui|xo)|mc(01|21|ca)|m\\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\\-2|po(ck|rt|se)|prox|psio|pt\\-g|qa\\-a|qc(07|12|21|32|60|\\-[2-7]|i\\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\\-|oo|p\\-)|sdk\\/|se(c(\\-|0|1)|47|mc|nd|ri)|sgh\\-|shar|sie(\\-|m)|sk\\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\\-|v\\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\\-|tdg\\-|tel(i|m)|tim\\-|t\\-mo|to(pl|sh)|ts(70|m\\-|m3|m5)|tx\\-9|up(\\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\\-|your|zeto|zte\\-/i.test(d.substr(0,4))){isUAMobile=true}if(queryValue_long===fls){if(window.localStorage){window.localStorage.setItem(redirection_param_long,fls)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(3600*1000*24*31*cookie_hours).toUTCString()}}else{if((queryValue_long===tr)||(queryValue===tr)){if(window.localStorage){window.localStorage.setItem(redirection_param_long,tr)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(-3600*1000*cookie_hours).toUTCString()}if(window.sessionStorage){window.sessionStorage.setItem(redirection_param,tr)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(-3600*1000*cookie_hours*7*24*31).toUTCString()}}else{if(document.referrer.indexOf(mobile_host)>=0||queryValue===fls){if(window.sessionStorage){window.sessionStorage.setItem(redirection_param,fls)}else{document.cookie=redirection_param+\"=\"+h+\";expires=\"+j(3600*1000*cookie_hours).toUTCString()}}}}var a=(window.sessionStorage)?(window.sessionStorage.getItem(redirection_param)===fls):false,e=(window.localStorage)?(window.localStorage.getItem(redirection_param_long)===fls):false,c=document.cookie?(document.cookie.indexOf(redirection_param)>=0):false;if(isUAMobile&&!(c||a||e)){document.location.href=mobile_protocol+\"//\"+mobile_host}};SA.redirection_mobile({mobile_url:\"m.bkref.com/m?p=XXplayersXXeXXellinwa01.html\",redirection_paramName:\"mobile\",redirection_paramName_long:\"mobile_long\",cookie_hours:6});\nvar googletag = googletag || {};googletag.cmd = googletag.cmd || [];(function() {var gads = document.createElement('script');gads.async = true;gads.type = 'text/javascript';var useSSL = 'https:' == document.location.protocol;gads.src = (useSSL ? 'https:' : 'http:') + '//www.googletagservices.com/tag/js/gpt.js';var node = document.getElementsByTagName('script')[0];node.parentNode.insertBefore(gads, node);})();   \nSports-Reference:\n Baseball \u00b7\n Basketball\n(college) \u00b7\n Football\n(college) \u00b7\n Hockey \u00b7\n Olympics \u00b7\n S-R Blog \u00b7\n Question or Comment?\n\n\n\n\n\n\u00a0\n\nLOGIN\nLOGOUT\nSPONSOR\nAD FREE\n\u00a0\n\n\n\n\n\n\n\n\n\nTips\n\n\n\n\n\n\n\n\n\nplay indexbox scoresplayersteamsseasonscoachesleadersawardsplayoffsdraftolympicsmore [+]Mobile Site You Are Here\u00a0>\u00a0BBR Home\u00a0>\u00a0Players\u00a0>\u00a0E\u00a0>\u00a0Wayne EllingtonNews: s-r blog:2014-15 NBA Schedules Added \n\n\n\n\ngoogletag.cmd.push(function() {\ngoogletag.defineSlot('/5702/yb_Sports_Reference', [[300,250]], 'restofsite_300x250_atf1').addService(googletag.pubads());\n\ngoogletag.enableServices();\n});\n\n\n\ngoogletag.cmd.push(function() { googletag.display('restofsite_300x250_atf1'); });\n\n\n", 
"overview_url": "http://www.basketball-reference.com/players/e/ellinwa01.html", 
"gamelog_data": null, 
"gamelog_url_list": [
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2014/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2013/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2012/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2010/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2009/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2008/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2007/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2006/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2005/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2004/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2003/", 
"http://www.basketball-reference.com/players/j/jamesmi01/gamelog/2002/"
]
...

I can submit the entire JSON that I captured. Additionally, it only took a few minutes (not 10-15) for buildPlayerDictionary() to run from AWS.

titipata commented 9 years ago

I think the scraper scrape too fast so it finally got blocked by the site. That's why you get only 100 players and it stopped in only a few minutes. You can use lynx browser to test whether Amazon EC2 got blocked by the site or not.

andrewgiessel commented 9 years ago

@titipata Have you re-built the json yourself recently? Can you confirm that it works? I can give it a shot tonight, but haven't rebuilt the json for a long time.

@brian-k @titipata A great PR would be to update the json file in the repo.

It wouldn't surprise me if the website changed. Scrapers are fragile...

titipata commented 9 years ago

@andrewgiessel I rebuild the json file and it works. I got around 600 players in json file.

For the json file, not sure if it's good to update the file on Github since it's pretty big. Therefore, it's going to be heavy when people clone the repository. Will we have a place to put the lastest json file? We can put it elsewhere and let people download it using python command e.g. bs.download(/path/to/directory).

andrewgiessel commented 9 years ago

Hm, good point about the size. But, we already have a ~12mb file in the repo. I'm not suggesting we keep it up to date, but it is a bit old. Maybe we should delete the file from the repo, and put the newest version in a public gist and have code that pulls it down.

On Mon, Oct 5, 2015 at 2:41 PM Titipat Achakulvisut < notifications@github.com> wrote:

@andrewgiessel https://github.com/andrewgiessel I rebuild the json file and it works. I got around 600 players in json file.

For the json file, not sure if it's good to update the file on Github since it's pretty big. Therefore, it's going to be heavy when people clone the repository. Will we have a place to put the lastest json file? We can put it elsewhere and let people download it using python command e.g. bs.download(/path/to/directory).

— Reply to this email directly or view it on GitHub https://github.com/andrewgiessel/basketballcrawler/issues/2#issuecomment-145626832 .