NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

can you support use google ip to scrape ? #52

Closed leadscloud closed 9 years ago

leadscloud commented 9 years ago

i know google have thousands of ip, 216.58.209.154 , 74.125.198.199

can you let the program support use google's ip to scrape ?

http://216.58.209.154/search

NikolaiT commented 9 years ago

This is quite tricky because it's not obvious which keyword should be scraped with which google ip address?

I tought about introducing a special keyword-file format:

Format:

"keyword " search_engine search_engine_url number_of_pages proxy 

Example:

"love is good " google "http://216.58.209.154/search" 5 socks5:localhost:9050
"love is bad " yahoo "yahoo.com" 11 http:11.22.33.44:5433
"love may be good " bing "http://56.32.111.154/search" 5 socks4:localhost:80

This is very good, because you can control the scraping process very easy. Also, it is very simple to create such a keyword file with any scripting language [Python].

The normal format would also work (one keyword per line), but you can add as many options for every keyword as you wish.

What do you think?

leadscloud commented 9 years ago

i think that's a good idea. keywords file should be easy.

when use google, form google_ips.txt get random ip, and then set google base_url random.

NikolaiT commented 9 years ago

Do you have some google ip's to fill such a file?

leadscloud commented 9 years ago

In China Google Server is blocking. but these ip can be use. https://cb.e-fly.org/usr/uploads/2015/01/1460830014.txt

https://cb.e-fly.org/archives/goagent-iplist.html the site supply google avalible ip in China.

NikolaiT commented 9 years ago

You can now provide a file with may urls for a search engine. GoogleScraper will pick one randomly. See:

; In some countries the main search engine domain is blocked. Thus, search engines
; have different ip on which they are reachable. If you set a file with urls for the search engine,
; then GoogleScraper will pick a random url for any scraper instance.
; One url per line. It needs to be a valid url, not just an ip address!
; Example: google_ip_file: google_ips.txt

google_ip_file: kwfiles/google_ip.txt

Currently not working in http-mode because it has a bug on requests. Filed a report on https://github.com/kennethreitz/requests/issues/2404

leadscloud commented 9 years ago

request with different ip have another method: http://stackoverflow.com/questions/1150332/source-interface-with-python-and-urllib2

In C#, i know you can set http proxy with google ip, and then the request url is the same. but requsted ip is another.

NikolaiT commented 9 years ago

Now works with http-mode. Closing issue.