Something wrong with regex ?

erilu / web-scraping-NBA-statistics

Use Python to scrape ESPN for stats on all players in the NBA. Obtain and organize data, calculate statistics, and model using urllib, re, pandas, and scikit-learn.

16 stars 10 forks source link

Hi John,

Thanks for reaching out, and sorry for the really delayed response. I missed the issue notification.

I just ran each instance of re.findall() that I used in the notebook, and found the regex that didn't work. It was the player_regex for scraping player information from each roster page. As you suspected, the website was updated and now this regex is out of date. Nothing is retrieved by re.findall(), leaving the player_info dict empty.

To fix this, use this updated regex:

player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"https?\://www\.espn\.com/nba/player/.*?\",(.*?)\}')

Looking closely, the new page source uses "https" instead of "http" for each player's webpage. To update the regex, I made it recognize either "http" or "https" by using https?. The ? makes the letter s optional. I'll update this repo's python script with this change as well, so you can pull the updated version if you're still interested in trying it out.

Thank you for pointing this out! Your issue highlights the importance of crafting better regexes that are more robust against webpages updating.

Best, Erick

erilu / web-scraping-NBA-statistics

Something wrong with regex ? #1