NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

TypeError: proxy_check() missing 1 required positional argument: 'proxy' #94

Open DharmeshPandav opened 9 years ago

DharmeshPandav commented 9 years ago

I have freshly installed GoogleScraper and using it with Proxy file

Have changed following config file parameter :

#config.cfg file
keywords: filetype:pdf pwn2own
use_own_ip: False
check_proxies: True
proxy_file:proxyfile.txt

I am executing Git hub project using run.py

./run.py -s "google"  --num-pages-for-keyword 5  --output-filename output.json -v3

and it is giving following error

TypeError: proxy_check() missing 1 required positional argument: 'proxy'
Exception in thread [google]HttpScrape:
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/home/jonsnow/Documents/Google Scraper/GoogleScraper/GoogleScraper/http_mode.py", line 302, in run
    super().before_search()
  File "/home/jonsnow/Documents/Google Scraper/GoogleScraper/GoogleScraper/scraping.py", line 375, in before_search
    if not self.proxy_check():
TypeError: proxy_check() missing 1 required positional argument: 'proxy'

was hacking through the code and injected this code to test if proxy file provided is being read correctly or not. inside the function def parse_proxy_file(fname): of proxies.py file the object that is being return from there is having record from the provided file

# modified parse_proxy_file to check if file is being parsed correctly or not
def parse_proxy_file(fname):
    """Parses a proxy file

    The format should be like the following:

        socks5 23.212.45.13:1080 username:password
        socks4 23.212.45.13:80 username:password
        http 23.212.45.13:80

        If username and password aren't provided, GoogleScraper assumes
        that the proxy doesn't need auth credentials.

    Args:
        fname: The file name where to look for proxies.

    Returns:
        The parsed proxies.

    Raises:
        ValueError if no file with the path fname could be found.
    """
    print ("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
    print (fname)

    proxies = []
    path = os.path.join(os.getcwd(), fname)
    if os.path.exists(path):
        print ("Yes file exists")
        with open(path, 'r') as pf:
            for line in pf.readlines():
                print (line)
                if not (line.strip().startswith('#') or line.strip().startswith('//')):
                    tokens = line.replace('\n', '').split(' ')
                    try:
                        proto = tokens[0]
                        host, port = tokens[1].split(':')
                    except:
                        print("iam fcuked")
                        raise Exception(
                            'Invalid proxy file. Should have the following format: {}'.format(parse_proxy_file.__doc__))
                    if len(tokens) == 3:
                        username, password = tokens[2].split(':')
                        proxies.append(Proxy(proto=proto, host=host, port=port, username=username, password=password))
                    else:
                        print (proto)
                        print (host)
                        print (port)
                        proxies.append(Proxy(proto=proto, host=host, port=port, username='', password=''))

            for proxy in proxies:
                print (proxy.proto)
                print (proxy.host)
                print (proxy.port)
        print ("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
        return proxies
    else:
        raise ValueError('No such file/directory')

and the output is same as it is there in the proxy file provided


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
proxyfile.txt
Yes file exists
http 127.0.0.1:8123
http
127.0.0.1
8123

socks5 127.0.0.1:9050
socks5
127.0.0.1
9050

http
127.0.0.1
8123
socks5
127.0.0.1
9050
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

so i guess file is being read correctly ...but after that it throws error that proxy object is null..

i am wondering why proxy object is overwritten to null ...is there a problem or some workaround where i need to configure my config file a bit and need to be fixed in next release ?

i am using

DharmeshPandav commented 9 years ago

Please find below the debug log..

Till to the point where all thread is appended in main thread(threads) .. all the threads are having proxy parameter associated with them...(please check the image)

and when Scraper tries to start the thread (still thread is having proxy parameter) it throws an error that proxy parameter is missing..

was not able to step into start method to check , how threading is handled

error_source

DharmeshPandav commented 9 years ago

After completing a brainstorming debuggin ..i came to an conclusion that

Call of the method proxy_check() in file scrapping.py is not matching with the signature of the method `proxy_check(self,proxy)

need to modify the call to the method in file scrapping.py line no :375 from if not self.proxy_check(): to if not self.proxy_check(self.proxy):

DharmeshPandav commented 9 years ago

Weird .. No matter what ,it always uses use own ip address as True.. and i can see the same in google banned (503 return result) where it shows my original ip ...not the ip address used by tor exit node..

I am providing two proxy as an input, one is http(polipo) and otheris (tor)..and then from command prompt i am stopping this proxy services ..and when are run the googlescraper ..instead of refusing the connection , it still goes ahead and scrapes the pages( while use-own-ip flag is set to false) using own ip address and returns 503 page as an evidance...

couldn't tweak the socks.py or http_mode for this to work ..please help

neuegram commented 8 years ago

I had the same issue. Fixed by changing the following line of code in scraping.py:

def before_search(self):
        """Things that need to happen before entering the search loop."""
        # check proxies first before anything
        if Config['SCRAPING'].getboolean('check_proxies', True) and self.proxy:
            #Replaced This Code: if not self.proxy_check():
            if not self.proxy_check(self.proxy): # Working Code
                self.startable = False
derekcoleman commented 8 years ago

This is the error that I get when doing that:

if Config['SCRAPING'].getboolean('check_proxies', True) and self.proxy: NameError: name 'Config' is not defined

Stas-Buzunko commented 8 years ago

Hi! Have you found out how to fix this?

jmwoloso commented 7 years ago

Any movement on this?

fassn commented 7 years ago

Hello, I have been investigating in this issue since I stumbled upon the problem: proxies are being passed just fine in 'selenium' mode, but not in 'http' mode.

It turns out, if I ain't wrong, that proxies are just never passed in requests in http mode at all!

Actually this also caused me some problems with the proxy table of the database when calling the proxy_check function. It was updating different values for the hostname than the one provided from my proxy file: indeed, it was pulling my own ip and trying to replace the proxy ip who should normally have been used!

This is the code in the proxy_check() function:

def proxy_check(self, proxy):
        assert self.proxy and self.requests, 'ScraperWorker needs valid proxy instance and requests library to make ' \
                                             'the proxy check.'

        online = False
        status = 'Proxy check failed: {host}:{port} is not used while requesting'.format(host=self.proxy.host, port=self.proxy.port)
        ipinfo = {}
        try:
            text = self.requests.get(self.config.get('proxy_info_url'),
                                     proxies={self.proxy.proto: 'http://' + self.proxy.host + ':' + self.proxy.port}
                                     ).text

that I changed for the following:

[...]
            text = self.requests.get(self.config.get('proxy_info_url'),
                                     proxies={self.proxy.proto: 'http://' + self.proxy.host + ':' + self.proxy.port}
                                     ).text

and same goes for the request itself in the search() function:

request = self.requests.get(self.base_search_url + urlencode(self.search_params),
                                        headers=self.headers,
                                        timeout=timeout)

that I changed for:

request = self.requests.get(self.base_search_url + urlencode(self.search_params),
                                        headers=self.headers,
                                        timeout=timeout,
                                        proxies={'https': 'https://' + self.proxy.host + ':' + self.proxy.port}
                                        )

Note how I specified 'https' instead of 'http'. It seems the proxies were still not passed correctly when using 'http' in my case. I'm a bit of a newbie and wouldn't be able to explain why. If someone has some explanations, I'd be glad to hear them. Anyway, hope this can help!