Nv7-GitHub / googlesearch

A Python library for scraping the Google search engine.
https://pypi.org/project/googlesearch-python/
MIT License
430 stars 110 forks source link

Weird user-agent behavior and infinite while loop risk. #49

Closed sif-gondy closed 1 year ago

sif-gondy commented 1 year ago

Thank you for the 1.2.2 update.

I've be working on the project a bit over the past week, I have noted some potential problems;

The user_agents.py feature is a welcomed addition, but it made the code fail for me (I am located in Europe?). v1.1.0 had a static recent (and common) user agent (Windows 10), but the user agent list in the file has many highly specific/unused user-agents.

Thus, each time I attempted a request Google redirected me to a consent url for cookies validation

It appears that google (at least in my region) flags unused/weird user agents and prompt them to accept cookies.

To accept the cookies programmatically and not have to write a specific POST function to accept/reject the cookies you can pass this in your headers. It does the trick and you land on the right page request;

headers = {
            "User-Agent": self.user_agent,
            "Cookie": "CONSENT=YES+cb.20220302-17-p0.en+FX+100; NID=0"
        }

This led to another problem: getting redirected on the consent page (or even retrieving the html from the session with the cookies headers) led to a somewhat different html structure that did not contain:

result_block = soup.find_all('div', attrs={'class': 'g'})

--> result_block takes the value of an empty list

Because further down the while loop, start is only incremented if link and title and description exist, this results in an infinite while loop.

I would suggest to consolidate the code to provide and provide an exit in the case result block is an empty list?

Using widely used and recent user-agents using latest OSX, or Windows 10 seems to do the trick for me in the meantime..

import random

def get_useragent():
    return random.choice(_useragent_list)

_useragent_list = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0'
]
Nv7-GitHub commented 1 year ago

I updated the user agents in 1.2.3, thanks for reporting! I am surprised about the google redirect, perhaps I could check for that in the future