Open frandres opened 9 years ago
Will look into this! Huge tanks :)
I'm working on a solution for this. Chances are stem use will be limited. It's much easier to spawn a ton of Tor instances with different ports and connect through those. Stem can then be used to get a new IP if the number of instances is not great enough, as well as to sort relays by connection speed so that we may prefer the fastest ones.
Both methods work pretty well, bunch or tor instances and then rotate ips periodically if google does ban them temporarily, front the whole bunch with haproxy -> privoxy
Frandres, It would be gr8 if you can send selenium_mode.py version of Tor settings and other files which I would like to try.Thanks in advance.
Could you send me the new selenium_mode.py for TOR?
Using Tor Browser (https://www.torproject.org/projects/torbrowser.html.en) is a nice way to run queries with different ips without having to set proxies. Since it is based on Firefox, it's relatively easy to set up with GoogleScrapper. You just have load firefox using a profile with Tor's configuration, which means changing this line: self.webdriver = webdriver.Firefox()
in the _get_Firefox method in the selenium_mode.py file to
The nice thing is that you can request a new IP whenever you want by running:
from stem import Signal from stem.control import Controller with Controller.from_port(port = 9151) as controller: controller.authenticate() controller.signal(Signal.NEWNYM) controller.signal(Signal.HUP)
Meaning that whenever your script is caught as a robot you can request a new IP, load a new instance of Firefox and resume your scrapping. Empirically this seems to work; the search engine catches you sometimes but if you keep trying it seems to eventually get an IP that is not detected.
This might be a nice feature for future development. I can send my selenium_mode.py version to whoever wants to try this.