NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

Proxy allocation to Selenium instances #87

Open TheFifthFreedom opened 9 years ago

TheFifthFreedom commented 9 years ago

I've noticed that in the loop to create ScrapeWorkerFactorys in core.py, there's a line that loops though every proxy in the given file (if one chooses to use one), which ends up creating more browser instances than you might limit with your config's num_workers:

        # Let the games begin
        if method in ('selenium', 'http'):

            # Show the progress of the scraping
            q = queue.Queue()
            progress_thread = ShowProgressQueue(q, len(scrape_jobs))
            progress_thread.start()

            workers = queue.Queue()
            num_worker = 0
            for search_engine in search_engines:

                for proxy in proxies:

                    for worker in range(num_workers):
                        num_worker += 1
                        workers.put(
                            ScrapeWorkerFactory(
                                mode=method,
                                proxy=proxy,
                                search_engine=search_engine,
                                session=session,
                                db_lock=db_lock,
                                cache_lock=cache_lock,
                                scraper_search=scraper_search,
                                captcha_lock=captcha_lock,
                                progress_queue=q,
                                browser_num=num_worker
                            )
                        )

Not only is this not the behavior we want, it might end up crashing your machine if you have a set of, say, 100 proxies for instance. I believe one solution to this problem would be to remove the loop entirely and pick a proxy every time we're looping though num_workers:

        # Let the games begin
        if method in ('selenium', 'http'):

            # Show the progress of the scraping
            q = queue.Queue()
            progress_thread = ShowProgressQueue(q, len(scrape_jobs))
            progress_thread.start()

            workers = queue.Queue()
            num_worker = 0
            for search_engine in search_engines:

                for worker in range(num_workers):
                    num_worker += 1
                    proxy_to_use = proxies[worker % len(proxies)]
                    workers.put(
                        ScrapeWorkerFactory(
                            mode=method,
                            proxy=proxy_to_use,
                            search_engine=search_engine,
                            session=session,
                            db_lock=db_lock,
                            cache_lock=cache_lock,
                            scraper_search=scraper_search,
                            captcha_lock=captcha_lock,
                            progress_queue=q,
                            browser_num=num_worker
                        )
                    )

What do you think @NikolaiT ?

fassn commented 7 years ago

It's a bit of a late answer, but I just put my nose into this project.

I also do not really understand how @NikolaiT intended to design the num_worker variable the threads are not being helpful in Selenium mode. I reworked a bit the loop and following code such as it prevents the threads from opening all the windows at the same time, but based on the amount specified in num_worker.

num_worker = 0
            for search_engine in search_engines:

                for proxy in proxies:

                    # for worker in range(num_workers):

                    num_worker += 1
                    workers.put(
                            ScrapeWorkerFactory(
                                    config,
                                    cache_manager=cache_manager,
                                    mode=method,
                                    proxy=proxy,
                                    search_engine=search_engine,
                                    session=session,
                                    db_lock=db_lock,
                                    cache_lock=cache_lock,
                                    scraper_search=scraper_search,
                                    captcha_lock=captcha_lock,
                                    progress_queue=q,
                                    browser_num=num_worker
                            )
                    )

# here we look for suitable workers
            # for all jobs created.
            for job in scrape_jobs:
                while True:
                    worker = workers.get()
                    if worker.is_suitabe(job):
                        worker.add_job(job)
                        workers.put(worker)
                        break

            threads = []

            while not workers.empty():
                worker = workers.get()
                thread = worker.get_worker()
                if thread:
                    threads.append(thread)

            # this is the old code 
            # for t in threads:
            #     t.join()
            # changed for the following:

            num_thread = 0

            while num_thread <= threads.__len__():

                for t in threads[num_thread:num_thread + num_workers]:
                    t.start()

                for t in threads[num_thread:num_thread + num_workers]:
                    t.join()

                num_thread += num_workers

            # after threads are done, stop the progress queue.

It's already working much better in my opinion.