NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.63k stars 734 forks source link

Ads Links not showing #75

Open amirmarmor opened 9 years ago

amirmarmor commented 9 years ago

I trying to scrape google with simple http requests results but no paid links (ads) are being collected? is there a problem with the selectors? or is it something else

NikolaiT commented 9 years ago

Might be the selectors. Try to modify them in parsing.py in the class GoogleParser and let us know what works for you, such that I can incorporate it back. Thanks :)

amirmarmor commented 9 years ago

I played around with the selectors, and it starts to work for me. When I will have the final robust selectors I will post it.

And now I have two options:

  1. My selectors that work without selenium
  2. Your selectors that seem to be working with selenium.

My question is, what are the pros and cons, and I guess that the answer is that the selenium will require less proxy IPs because it is less detectable. Am I right? Is this difference significant?

NikolaiT commented 9 years ago

You can specify whatever string to your selectors that you like. Best is of course a good description:

        'ads_main': {
            'us_ip': {
                'container': '#b_results .b_ad',
                'result_container': '.sb_add',
                'link': 'h2 > a::attr(href)',
                'snippet': '.sb_addesc::text',
                'title': 'h2 > a::text',
                'visible_link': 'cite::text'
            },
            'ONLY_HTTP': {
                'container': '#b_results .b_ad',
                'result_container': '.sb_add',
                'link': 'h2 > a::attr(href)',
                'snippet': '.b_caption > p::text',
                'title': 'h2 > a::text',
                'visible_link': 'cite::text'
            }
        }

that the selenium will require less proxy IPs because it is less detectable.

Not very significant in Bing, Baidu and others. Very significant with Google (but they fixed that a while ago). I could scrape 10000 keywords in 2 hours with Google some months ago in selenium mode. Now after 50 queries they block you :)

Still works for Bing though. They don't have rate limits. I can scrape 500 keywords in a second with http-async mode. Do it if you feel frisky :D