Proxy allocation to Selenium instances

NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.

Apache License 2.0

2.6k stars 734 forks source link

I've noticed that in the loop to create ScrapeWorkerFactorys in core.py, there's a line that loops though every proxy in the given file (if one chooses to use one), which ends up creating more browser instances than you might limit with your config's num_workers:

        # Let the games begin
        if method in ('selenium', 'http'):

            # Show the progress of the scraping
            q = queue.Queue()
            progress_thread = ShowProgressQueue(q, len(scrape_jobs))
            progress_thread.start()

            workers = queue.Queue()
            num_worker = 0
            for search_engine in search_engines:

                for proxy in proxies:

                    for worker in range(num_workers):
                        num_worker += 1
                        workers.put(
                            ScrapeWorkerFactory(
                                mode=method,
                                proxy=proxy,
                                search_engine=search_engine,
                                session=session,
                                db_lock=db_lock,
                                cache_lock=cache_lock,
                                scraper_search=scraper_search,
                                captcha_lock=captcha_lock,
                                progress_queue=q,
                                browser_num=num_worker
                            )
                        )

Not only is this not the behavior we want, it might end up crashing your machine if you have a set of, say, 100 proxies for instance. I believe one solution to this problem would be to remove the loop entirely and pick a proxy every time we're looping though num_workers:

        # Let the games begin
        if method in ('selenium', 'http'):

            # Show the progress of the scraping
            q = queue.Queue()
            progress_thread = ShowProgressQueue(q, len(scrape_jobs))
            progress_thread.start()

            workers = queue.Queue()
            num_worker = 0
            for search_engine in search_engines:

                for worker in range(num_workers):
                    num_worker += 1
                    proxy_to_use = proxies[worker % len(proxies)]
                    workers.put(
                        ScrapeWorkerFactory(
                            mode=method,
                            proxy=proxy_to_use,
                            search_engine=search_engine,
                            session=session,
                            db_lock=db_lock,
                            cache_lock=cache_lock,
                            scraper_search=scraper_search,
                            captcha_lock=captcha_lock,
                            progress_queue=q,
                            browser_num=num_worker
                        )
                    )

What do you think @NikolaiT ?

num_worker = 0 for search_engine in search_engines: for proxy in proxies: # for worker in range(num_workers): num_worker += 1 workers.put( ScrapeWorkerFactory( config, cache_manager=cache_manager, mode=method, proxy=proxy, search_engine=search_engine, session=session, db_lock=db_lock, cache_lock=cache_lock, scraper_search=scraper_search, captcha_lock=captcha_lock, progress_queue=q, browser_num=num_worker ) ) # here we look for suitable workers # for all jobs created. for job in scrape_jobs: while True: worker = workers.get() if worker.is_suitabe(job): worker.add_job(job) workers.put(worker) break threads = [] while not workers.empty(): worker = workers.get() thread = worker.get_worker() if thread: threads.append(thread) # this is the old code # for t in threads: # t.join() # changed for the following: num_thread = 0 while num_thread <= threads.__len__(): for t in threads[num_thread:num_thread + num_workers]: t.start() for t in threads[num_thread:num_thread + num_workers]: t.join() num_thread += num_workers # after threads are done, stop the progress queue.

NikolaiT / GoogleScraper

Proxy allocation to Selenium instances #87