Closed leadscloud closed 9 years ago
This is quite tricky because it's not obvious which keyword should be scraped with which google ip address?
I tought about introducing a special keyword-file format:
Format:
"keyword " search_engine search_engine_url number_of_pages proxy
Example:
"love is good " google "http://216.58.209.154/search" 5 socks5:localhost:9050
"love is bad " yahoo "yahoo.com" 11 http:11.22.33.44:5433
"love may be good " bing "http://56.32.111.154/search" 5 socks4:localhost:80
This is very good, because you can control the scraping process very easy. Also, it is very simple to create such a keyword file with any scripting language [Python].
The normal format would also work (one keyword per line), but you can add as many options for every keyword as you wish.
What do you think?
i think that's a good idea. keywords file should be easy.
when use google, form google_ips.txt get random ip, and then set google base_url random.
Do you have some google ip's to fill such a file?
In China Google Server is blocking. but these ip can be use. https://cb.e-fly.org/usr/uploads/2015/01/1460830014.txt
https://cb.e-fly.org/archives/goagent-iplist.html the site supply google avalible ip in China.
You can now provide a file with may urls for a search engine. GoogleScraper will pick one randomly. See:
; In some countries the main search engine domain is blocked. Thus, search engines
; have different ip on which they are reachable. If you set a file with urls for the search engine,
; then GoogleScraper will pick a random url for any scraper instance.
; One url per line. It needs to be a valid url, not just an ip address!
; Example: google_ip_file: google_ips.txt
google_ip_file: kwfiles/google_ip.txt
Currently not working in http-mode because it has a bug on requests. Filed a report on https://github.com/kennethreitz/requests/issues/2404
request with different ip have another method: http://stackoverflow.com/questions/1150332/source-interface-with-python-and-urllib2
In C#, i know you can set http proxy with google ip, and then the request url is the same. but requsted ip is another.
Now works with http-mode. Closing issue.
i know google have thousands of ip,
216.58.209.154
,74.125.198.199
can you let the program support use google's ip to scrape ?
http://216.58.209.154/search