MarioVilas / googlesearch

Google search from Python (unofficial).
BSD 3-Clause "New" or "Revised" License
1.12k stars 384 forks source link

Looks like Google changed things #94

Closed jowolf closed 3 years ago

jowolf commented 3 years ago

Current request for shopping results return empty, regardless of what's being searched for - and when done an Google itself it returns plenty of stuff just fine -

I observe that Google's current parm is 'tbm=shop', which does not appear to match what you're using (tpe).

[UPDATE 7/10 - I just looked at your code and you are indeed using tbm, tpe is just the param name you use, short for 'type' - so disregard my last sentence above.]

j

MarioVilas commented 3 years ago

Odd! I just tried this on my machine and it worked:

$ google lalala
http://www.lalalalalalalalalalalalalalalalalala.com/
http://lalala.world/
https://www.youtube.com/watch?v=N2Y2vQ-1m7M
https://www.lalalab.com/en/
https://en.wikipedia.org/wiki/Lalala_(song)
http://www.lalala.com.tr/
https://lalalafest.com/
^CTraceback (most recent call last):
  File "/usr/local/bin/google", line 4, in <module>
    __import__('pkg_resources').run_script('google==2.0.3', 'google')
  File "/home/mario/.local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/mario/.local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/google-2.0.3-py2.7.egg/EGG-INFO/scripts/google", line 137, in <module>
    main()
  File "/usr/local/lib/python2.7/dist-packages/google-2.0.3-py2.7.egg/EGG-INFO/scripts/google", line 128, in main
    for url in search(query, **params):
  File "/usr/local/lib/python2.7/dist-packages/google-2.0.3-py2.7.egg/googlesearch/__init__.py", line 309, in search
    time.sleep(pause)
KeyboardInterrupt

Can you give me some more details, so I can reproduce the problem?

jowolf commented 3 years ago

Regular search works, shopping search doesn't - here's my ipython transcript:

In [1]: from googlesearch import search_shop, search                                                

In [3]: for i in search ('wd red drives'): print (i)                                                

https://shop.westerndigital.com/products/internal-drives/wd-red-sata-hdd
https://www.westerndigital.com/products/internal-drives/wd-red-hdd
https://blog.westerndigital.com/wd-red-nas-drives/
https://www.amazon.com/Red-3TB-NAS-Hard-Drive/dp/B008JJLW4M
https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/
[...plenty of results deleted....]
https://stonehousebrands.com/amgn/zfs-disable-checksum.html
http://uneterreculturelle.org/dfeb/best-nas-for-home-use.html
http://mergeapp.co.za/s4ym24/wd20ezaz-smr.html
http://informaxsempre.it/jjtp/qnap-stuck-on-starting.html

In [4]: for i in search_shop ('wd red drives'): print (i)                                           
(nothing)

In [5]: for i in search ('wd red drives', tpe='shop'): print (i)                                    
(also nothing)
jowolf commented 3 years ago

Looks like google now builds the entire page with js, and names classes, etc with random names - yuk. Turning off js in my browser (Firefox) helps return more readable / scrapable results, but I can't find a command-line GET parm to do that.

MarioVilas commented 3 years ago

I can reproduce the problem now.

Yeah, Google has been building the search page entirely from JavaScript for a while, but there is one of the parameters in the URL that disabled this behavior. Seems like they are forcing the JavaScript only version now for shop searches, I can only imagine this is intentional because people were scraping the results for shopping bots.

I'll see if I can work around it somehow...

MarioVilas commented 3 years ago

Definitely looks like an anti-bot thing. Even if I use the exact same URL parameters as a manual search with JS disabled on Firefox I still get 0 results. (And it's not a parsing error - rendering the HTML response gives you the "try more broad search keywords" message).

I'm guessing they're also detecting something else, like a particular order of arguments, HTTP headers, user-agent, etc...

MarioVilas commented 3 years ago

Turns out it was the user agent. I have a strong feeling they specifically blacklisted the default user agent in my Python library - I feel attacked xDDD

MarioVilas commented 3 years ago

Aaaand once I bypass that there are more defenses. Clearly someone has been playing a cat and mouse game here and it wasn't me.

I'm disabling the shopping search until further notice. I flat out refuse to chase after whatever Google does next to break my library, folks looking to implement shopping bots will have to come up with their own solutions.

jowolf commented 3 years ago

Your decision, Mario -

FWIW, In looking further, I found it - there's one parm in the query that's different (source), once I turn off JS - and if I do a wget on that full URL, it returns valid, parseable HTML - with nested divs, and the tabular data about 4-5 levels deep, with the div's all having 5-character randomly-generated class names.

You'd have to do something like count the occurrence of each classname to figure out which one is the main parent of each result, and go from there - or look for specific text in the result and work back up.

And yes, I concur that this sort of cat-and-mouse game aka tech "arms race", is never a productive undertaking (with the possible exception of ad-blocking, but even that has it's issues).

Although I don't approve of Google obfuscating its results either...

j

jowolf commented 3 years ago

One more thing - I did have to set the user agent for wget, the default user agent returned 403 Forbidden (as you mentioned earlier).

Just a simple "Mozilla" did the trick, but who knows what would happen if you did it a few (hundred) more times.