Open amirmarmor opened 9 years ago
Might be the selectors. Try to modify them in parsing.py in the class GoogleParser
and let us know what works for you, such that I can incorporate it back. Thanks :)
I played around with the selectors, and it starts to work for me. When I will have the final robust selectors I will post it.
And now I have two options:
My question is, what are the pros and cons, and I guess that the answer is that the selenium will require less proxy IPs because it is less detectable. Am I right? Is this difference significant?
You can specify whatever string to your selectors that you like. Best is of course a good description:
'ads_main': {
'us_ip': {
'container': '#b_results .b_ad',
'result_container': '.sb_add',
'link': 'h2 > a::attr(href)',
'snippet': '.sb_addesc::text',
'title': 'h2 > a::text',
'visible_link': 'cite::text'
},
'ONLY_HTTP': {
'container': '#b_results .b_ad',
'result_container': '.sb_add',
'link': 'h2 > a::attr(href)',
'snippet': '.b_caption > p::text',
'title': 'h2 > a::text',
'visible_link': 'cite::text'
}
}
that the selenium will require less proxy IPs because it is less detectable.
Not very significant in Bing, Baidu and others. Very significant with Google (but they fixed that a while ago). I could scrape 10000 keywords in 2 hours with Google some months ago in selenium mode. Now after 50 queries they block you :)
Still works for Bing though. They don't have rate limits. I can scrape 500 keywords in a second with http-async mode. Do it if you feel frisky :D
I trying to scrape google with simple http requests results but no paid links (ads) are being collected? is there a problem with the selectors? or is it something else