erilu / web-scraping-NBA-statistics

Use Python to scrape ESPN for stats on all players in the NBA. Obtain and organize data, calculate statistics, and model using urllib, re, pandas, and scikit-learn.
https://erilu.github.io/web-scraping-NBA-statistics/
16 stars 9 forks source link

Something wrong with regex ? #1

Closed Johnzav888 closed 3 years ago

Johnzav888 commented 3 years ago

Hi erilu,

First of all, thanks for the very nice and useful notebook! Really appreciate it...

Now, unfortunately, i cannot make it work. Especially, when i run the re.findall() function, although runs it returns an empty dict... Which probably means that it couldn't find anything on the website with the given regex... I tried to check the website and i observed some differences with what you provide in the highlighted screenshot.

Maybe something changed then in the website end and now this regex doesn't work anymore ? Or of course, i am totally noob here...

Can you help me with that please ?

Thanks, John

erilu commented 3 years ago

Hi John,

Thanks for reaching out, and sorry for the really delayed response. I missed the issue notification.

I just ran each instance of re.findall() that I used in the notebook, and found the regex that didn't work. It was the player_regex for scraping player information from each roster page. As you suspected, the website was updated and now this regex is out of date. Nothing is retrieved by re.findall(), leaving the player_info dict empty.

To fix this, use this updated regex:

player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"https?\://www\.espn\.com/nba/player/.*?\",(.*?)\}')

Looking closely, the new page source uses "https" instead of "http" for each player's webpage. To update the regex, I made it recognize either "http" or "https" by using https?. The ? makes the letter s optional. I'll update this repo's python script with this change as well, so you can pull the updated version if you're still interested in trying it out.

Thank you for pointing this out! Your issue highlights the importance of crafting better regexes that are more robust against webpages updating.

Best, Erick