Closed jowolf closed 3 years ago
Odd! I just tried this on my machine and it worked:
$ google lalala
http://www.lalalalalalalalalalalalalalalalalala.com/
http://lalala.world/
https://www.youtube.com/watch?v=N2Y2vQ-1m7M
https://www.lalalab.com/en/
https://en.wikipedia.org/wiki/Lalala_(song)
http://www.lalala.com.tr/
https://lalalafest.com/
^CTraceback (most recent call last):
File "/usr/local/bin/google", line 4, in <module>
__import__('pkg_resources').run_script('google==2.0.3', 'google')
File "/home/mario/.local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/mario/.local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/google-2.0.3-py2.7.egg/EGG-INFO/scripts/google", line 137, in <module>
main()
File "/usr/local/lib/python2.7/dist-packages/google-2.0.3-py2.7.egg/EGG-INFO/scripts/google", line 128, in main
for url in search(query, **params):
File "/usr/local/lib/python2.7/dist-packages/google-2.0.3-py2.7.egg/googlesearch/__init__.py", line 309, in search
time.sleep(pause)
KeyboardInterrupt
Can you give me some more details, so I can reproduce the problem?
Regular search works, shopping search doesn't - here's my ipython transcript:
In [1]: from googlesearch import search_shop, search
In [3]: for i in search ('wd red drives'): print (i)
https://shop.westerndigital.com/products/internal-drives/wd-red-sata-hdd
https://www.westerndigital.com/products/internal-drives/wd-red-hdd
https://blog.westerndigital.com/wd-red-nas-drives/
https://www.amazon.com/Red-3TB-NAS-Hard-Drive/dp/B008JJLW4M
https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/
[...plenty of results deleted....]
https://stonehousebrands.com/amgn/zfs-disable-checksum.html
http://uneterreculturelle.org/dfeb/best-nas-for-home-use.html
http://mergeapp.co.za/s4ym24/wd20ezaz-smr.html
http://informaxsempre.it/jjtp/qnap-stuck-on-starting.html
In [4]: for i in search_shop ('wd red drives'): print (i)
(nothing)
In [5]: for i in search ('wd red drives', tpe='shop'): print (i)
(also nothing)
Looks like google now builds the entire page with js, and names classes, etc with random names - yuk. Turning off js in my browser (Firefox) helps return more readable / scrapable results, but I can't find a command-line GET parm to do that.
I can reproduce the problem now.
Yeah, Google has been building the search page entirely from JavaScript for a while, but there is one of the parameters in the URL that disabled this behavior. Seems like they are forcing the JavaScript only version now for shop searches, I can only imagine this is intentional because people were scraping the results for shopping bots.
I'll see if I can work around it somehow...
Definitely looks like an anti-bot thing. Even if I use the exact same URL parameters as a manual search with JS disabled on Firefox I still get 0 results. (And it's not a parsing error - rendering the HTML response gives you the "try more broad search keywords" message).
I'm guessing they're also detecting something else, like a particular order of arguments, HTTP headers, user-agent, etc...
Turns out it was the user agent. I have a strong feeling they specifically blacklisted the default user agent in my Python library - I feel attacked xDDD
Aaaand once I bypass that there are more defenses. Clearly someone has been playing a cat and mouse game here and it wasn't me.
I'm disabling the shopping search until further notice. I flat out refuse to chase after whatever Google does next to break my library, folks looking to implement shopping bots will have to come up with their own solutions.
Your decision, Mario -
FWIW, In looking further, I found it - there's one parm in the query that's different (source), once I turn off JS - and if I do a wget on that full URL, it returns valid, parseable HTML - with nested divs, and the tabular data about 4-5 levels deep, with the div's all having 5-character randomly-generated class names.
You'd have to do something like count the occurrence of each classname to figure out which one is the main parent of each result, and go from there - or look for specific text in the result and work back up.
And yes, I concur that this sort of cat-and-mouse game aka tech "arms race", is never a productive undertaking (with the possible exception of ad-blocking, but even that has it's issues).
Although I don't approve of Google obfuscating its results either...
j
One more thing - I did have to set the user agent for wget, the default user agent returned 403 Forbidden (as you mentioned earlier).
Just a simple "Mozilla" did the trick, but who knows what would happen if you did it a few (hundred) more times.
Current request for shopping results return empty, regardless of what's being searched for - and when done an Google itself it returns plenty of stuff just fine -
I observe that Google's current parm is 'tbm=shop', which does not appear to match what you're using (tpe).
[UPDATE 7/10 - I just looked at your code and you are indeed using tbm, tpe is just the param name you use, short for 'type' - so disregard my last sentence above.]
j