Open DharmeshPandav opened 9 years ago
Please find below the debug log..
Till to the point where all thread is appended in main thread(threads) .. all the threads are having proxy parameter associated with them...(please check the image)
and when Scraper tries to start the thread (still thread is having proxy parameter) it throws an error that proxy parameter is missing..
was not able to step into start method to check , how threading is handled
After completing a brainstorming debuggin ..i came to an conclusion that
Call of the method proxy_check()
in file scrapping.py
is not matching with the signature of the method `proxy_check(self,proxy)
need to modify the call to the method in file scrapping.py
line no :375 from
if not self.proxy_check():
to
if not self.proxy_check(self.proxy):
Weird .. No matter what ,it always uses use own ip address as True.. and i can see the same in google banned (503 return result) where it shows my original ip ...not the ip address used by tor exit node..
I am providing two proxy as an input, one is http(polipo) and otheris (tor)..and then from command prompt i am stopping this proxy services ..and when are run the googlescraper ..instead of refusing the connection , it still goes ahead and scrapes the pages( while use-own-ip flag is set to false) using own ip address and returns 503 page as an evidance...
couldn't tweak the socks.py or http_mode for this to work ..please help
I had the same issue. Fixed by changing the following line of code in scraping.py:
def before_search(self):
"""Things that need to happen before entering the search loop."""
# check proxies first before anything
if Config['SCRAPING'].getboolean('check_proxies', True) and self.proxy:
#Replaced This Code: if not self.proxy_check():
if not self.proxy_check(self.proxy): # Working Code
self.startable = False
This is the error that I get when doing that:
if Config['SCRAPING'].getboolean('check_proxies', True) and self.proxy: NameError: name 'Config' is not defined
Hi! Have you found out how to fix this?
Any movement on this?
Hello, I have been investigating in this issue since I stumbled upon the problem: proxies are being passed just fine in 'selenium' mode, but not in 'http' mode.
It turns out, if I ain't wrong, that proxies are just never passed in requests in http mode at all!
Actually this also caused me some problems with the proxy table of the database when calling the proxy_check function. It was updating different values for the hostname than the one provided from my proxy file: indeed, it was pulling my own ip and trying to replace the proxy ip who should normally have been used!
This is the code in the proxy_check() function:
def proxy_check(self, proxy):
assert self.proxy and self.requests, 'ScraperWorker needs valid proxy instance and requests library to make ' \
'the proxy check.'
online = False
status = 'Proxy check failed: {host}:{port} is not used while requesting'.format(host=self.proxy.host, port=self.proxy.port)
ipinfo = {}
try:
text = self.requests.get(self.config.get('proxy_info_url'),
proxies={self.proxy.proto: 'http://' + self.proxy.host + ':' + self.proxy.port}
).text
that I changed for the following:
[...]
text = self.requests.get(self.config.get('proxy_info_url'),
proxies={self.proxy.proto: 'http://' + self.proxy.host + ':' + self.proxy.port}
).text
and same goes for the request itself in the search() function:
request = self.requests.get(self.base_search_url + urlencode(self.search_params),
headers=self.headers,
timeout=timeout)
that I changed for:
request = self.requests.get(self.base_search_url + urlencode(self.search_params),
headers=self.headers,
timeout=timeout,
proxies={'https': 'https://' + self.proxy.host + ':' + self.proxy.port}
)
Note how I specified 'https' instead of 'http'. It seems the proxies were still not passed correctly when using 'http' in my case. I'm a bit of a newbie and wouldn't be able to explain why. If someone has some explanations, I'd be glad to hear them. Anyway, hope this can help!
I have freshly installed GoogleScraper and using it with Proxy file
Have changed following config file parameter :
I am executing Git hub project using run.py
and it is giving following error
was hacking through the code and injected this code to test if proxy file provided is being read correctly or not. inside the function def
parse_proxy_file(fname): of proxies.py
file the object that is being return from there is having record from the provided fileand the output is same as it is there in the proxy file provided
so i guess file is being read correctly ...but after that it throws error that proxy object is null..
i am wondering why proxy object is overwritten to null ...is there a problem or some workaround where i need to configure my config file a bit and need to be fixed in next release ?
i am using