NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.64k stars 743 forks source link

Support for TOR browser #105

Open frandres opened 9 years ago

frandres commented 9 years ago

Using Tor Browser (https://www.torproject.org/projects/torbrowser.html.en) is a nice way to run queries with different ips without having to set proxies. Since it is based on Firefox, it's relatively easy to set up with GoogleScrapper. You just have load firefox using a profile with Tor's configuration, which means changing this line: self.webdriver = webdriver.Firefox()

in the _get_Firefox method in the selenium_mode.py file to

            profile.set_preference('network.proxy.socks_port', 9150)
            profile.set_preference('network.proxy.type', 1)
            profile.set_preference('network.proxy.socks', '127.0.0.1')
            profile.set_preference('network.proxy.socks_port', 9150)
            self.webdriver = webdriver.Firefox(profile)

The nice thing is that you can request a new IP whenever you want by running:

from stem import Signal from stem.control import Controller with Controller.from_port(port = 9151) as controller: controller.authenticate() controller.signal(Signal.NEWNYM) controller.signal(Signal.HUP)

Meaning that whenever your script is caught as a robot you can request a new IP, load a new instance of Firefox and resume your scrapping. Empirically this seems to work; the search engine catches you sometimes but if you keep trying it seems to eventually get an IP that is not detected.

This might be a nice feature for future development. I can send my selenium_mode.py version to whoever wants to try this.

NikolaiT commented 9 years ago

Will look into this! Huge tanks :)

neuegram commented 9 years ago

I'm working on a solution for this. Chances are stem use will be limited. It's much easier to spawn a ton of Tor instances with different ports and connect through those. Stem can then be used to get a new IP if the number of instances is not great enough, as well as to sort relays by connection speed so that we may prefer the fastest ones.

matthewford commented 9 years ago

Both methods work pretty well, bunch or tor instances and then rotate ips periodically if google does ban them temporarily, front the whole bunch with haproxy -> privoxy

yh18190 commented 9 years ago

Frandres, It would be gr8 if you can send selenium_mode.py version of Tor settings and other files which I would like to try.Thanks in advance.

alon001 commented 7 years ago

Could you send me the new selenium_mode.py for TOR?