dcstats / CBBpy

A Python-based web scraper for NCAA basketball.
MIT License
11 stars 2 forks source link

Scraping always pauses and doesn't finish #51

Open jnmiller opened 4 months ago

jnmiller commented 4 months ago

Every time I try to scrape a season (men's), the process gets stuck and hangs. Ctrl-C always gives the same stack trace:

Getting data for season 2022
No games on 11/08/21:   4%|███▌                                                                              | 8 of 182 days scraped in 3.3 sec
Scraping 184 games on 11/09/21:   4%|███                                                                   | 8 of 182 days scraped in 204.8 sec
Traceback (most recent call last):
  File "<SNIP>/./scrape.py", line 30, in <module>
    infos, box_scores, pbps = scraper.get_games_season(season, info=True, box=False, pbp=False)
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/mens_scraper.py", line 80, in get_games_season
    return _get_games_season(season, "mens", info, box, pbp)
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/cbbpy_utils.py", line 233, in _get_games_season
    info = _get_games_range(
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/cbbpy_utils.py", line 186, in _get_games_range
    result = Parallel(n_jobs=cpus)(
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1952, in __call__
    return output if self.return_generator else list(output)
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1595, in _get_outputs
    yield from self._retrieve()
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1707, in _retrieve
    time.sleep(0.01)
KeyboardInterrupt

Is the source site just detecting the scraping and blocking my IP address? Or is something else going on?

I can sometimes successfully scrape a very short date range (like a weekend) but immediately after a success, it stops working and hangs again.

dcstats commented 4 months ago

Will look into this, thanks!

jnmiller commented 4 months ago

It's sure looking like a bot detector - starting fresh (no attempts in last 12-24h) it will scrape 100-250 games, then stop. I removed the joblib parallel loop, making it sequential, then ran the debugger eventually got a request returning a 503. When I open the that url in a browser, it also shows an error. But when I browse some other pages and try that page again later, it will start working both in the browser and the scraper (presumably it identifies my IP address as being a human browsing again?).

Some mitigations might be

I could possibly contribute if time allows. In the meantime is this data downloadable in bulk anywhere (at least 2010-2024 seasons)? I've looked and haven't yet found a free source with that whole time span and including pbp.

dcstats commented 4 months ago

@jnmiller interesting... the scraper uses rotating headers that have helped with the bot detection to the point where I've never had it block any of my scrapes. I haven't had the chance to run it since you raised this issue, so it's definitely possible that they've added more robust bot detection, but I don't see any issues raised on the cousin package for R (ncaahoopR), so I'm thinking this might be something different. let me try scraping a season when I get a second, but in the meantime I do have some data I can send you. what's your email?

jnmiller commented 4 months ago

Thanks, that would be great! G-mail: jarednmiller

dcstats commented 4 months ago

@jnmiller sent. I scraped the 23-24 season last night without issue, so I'm not sure what could be causing this issue. I'll still add some of these mitigations, but I'll have to do some more digging to figure out what might be causing this issue to pop up selectively

Mstolte02 commented 4 months ago

I am having the same issue unfortunately. Any chance you'd have data from 2017 to 2023 handy?

dcstats commented 4 months ago

@Mstolte02 @jnmiller could you both tell me what versions of python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz, tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see if I can replicate this issue

@Mstolte02 what's your email? I can send you data

Mstolte02 commented 4 months ago

@.***

On Tue, Mar 19, 2024 at 8:58 PM Daniel Cowan @.***> wrote:

@Mstolte02 https://github.com/Mstolte02 @jnmiller https://github.com/jnmiller could you both tell me what versions of python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz, tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see if I can replicate this issue

@Mstolte02 https://github.com/Mstolte02 what's your email? I can send you data

— Reply to this email directly, view it on GitHub https://github.com/dcstats/CBBpy/issues/51#issuecomment-2008475325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXD4HLPHCES6JEEG5TX7H3YZDNK3AVCNFSM6AAAAABETJRJOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGQ3TKMZSGU . You are receiving this because you were mentioned.Message ID: @.***>

dcstats commented 4 months ago

@Mstolte02 github obfuscates email addresses - send me an email (you can find mine at the bottom of CBBpy's README) and I'll reply with the data

dgilmore33 commented 4 months ago

I'm on date 11/13/23, looks like it just takes a long **s (not an email) time.

@dcstats could you open & assign me the issue of speeding up the method? I could use multi-threading and a rate-limiter. Once you do, I'll email you on the CBB.py email.

Thanks for making this repo! Looking forward to working together :)

dcstats commented 4 months ago

@dgilmore33 could you tell me what versions of python and the required packages you're using? I want to replicate this issue first, because locally I'm able to scrape entire seasons in around 30 minutes

dgilmore33 commented 4 months ago

@dcstats honestly I don't have an "issue", I'm used to long times to load data. I'll live.

Also, the more I think about it, better to keep a full season scrape at the current timeframe

version : 3.9.6 packages :

pip

conda==23.7.4

dcstats commented 4 months ago

@dgilmore33 how long scraping taking for you? if it's anything longer than 30 seconds per day, I think it's worth looking into speeding it up. I could also do something as simple as increasing the number of concurrently running jobs. I'm using multiprocessing, but you mentioned multithreading - would multithreading be better for this than multiprocessing?

crdarlin commented 4 months ago

Have you updated to the latest version? This might be the issue that was corrected with one of the recent bug fixes related to the requests package update.

On Thu, Mar 21, 2024, 3:38 PM Daniel Cowan @.***> wrote:

@dgilmore33 https://github.com/dgilmore33 how long scraping taking for you? if it's anything longer than 30 seconds per day, I think it's worth looking into speeding it up. I could also do something as simple as increasing the number of concurrently running jobs. I'm using multiprocessing, but you mentioned multithreading - would multithreading be better for this than multiprocessing?

— Reply to this email directly, view it on GitHub https://github.com/dcstats/CBBpy/issues/51#issuecomment-2013971897, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKUGUTIBIQE3NFZBTA2DZYTYZNONZAVCNFSM6AAAAABETJRJOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJTHE3TCOBZG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dgilmore33 commented 4 months ago

@crdarlin I'll update requests now, thx for the tip

@dcstats multithreading hasn't provided a performance boost w/ multiprocessing in my experience, so I wouldn't expect it to. I forked the repo so I can just change the # of workers.

Ultimately, I got my game_data for the regular season, so I should be fine updating it day-by-day until the end of the tourney. Thanks for the RE's!

dcstats commented 4 months ago

For now, I'm gonna mark this as an issue for a future release so I can push some other fixes. @jnmiller if you're still experience hanging on the latest version of CBBpy, let me know what versions of python and the required packages you're using so I can try to replicate.