Closed mattliscia closed 3 years ago
Hello, I have tried the most basic searches with no success:
googlenews = GoogleNews() googlenews.search('Trump') result = googlenews.result() print(len(result))
This returns 0. Having this issue on v1.3.8
Any help would be appreciated, thank you.
Are you running this on the cloud server? have you tried to run it locally or with different IP? This issue mostly happened when your platform is recognized as robot by google.
I am running it locally. What do you mean by cloud server?
Can you try this and see if it can return or not
import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0'}
url = 'https://news.google.com/'
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
page = response.read()
Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.
HurinHu, do you know how to modify it to get the total number of results over an unknown number of pages? And thank you so much for making this available!
Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.
While, it is wired, I have tried your code on my terminal, and it can return number 10
HurinHu, do you know how to modify it to get the total number of results over an unknown number of pages? And thank you so much for making this available!
Currently, it seems impossible to get total number of pages, because when you search from Google, it always shows you first 10 pages at the bottom, there is no way to get the total number
it's not possible to make a loop? like if the # of results from page 2 = 10, then try pgae 3?
it's not possible to make a loop? like if the # of results from page 2 = 10, then try pgae 3?
if there is one hundred pages, you will need to wait a long time to get the result, and there is a risk, google might recognize you as robot and block your IP
trust me, there's usually only a few pages max. the "# of results" thing is only accurate on the last page. umm...but do you how to write it? but i don't know python. or any language really.
and is there a "rest" or "back off" command to prevent triggering their roboID?
trust me, there's usually only a few pages max. the "# of results" thing is only accurate on the last page. umm...but do you how to write it? but i don't know python. or any language really.
even it is just 10 pages, it need to create 10 requests in a short time, it is not a good way to do it with a lot of uncertainty. If you want to do it, you can use while loop with increasing page number until it get empty result.
i see. from my experience trying this manually, even the hottest topics only get a few hundred results. o... is the number of results per page set already
i see. from my experience trying this manually, even the hottest topics only get a few hundred results. o... is the number of results per page set already
normally ten records per page
ahh yea i see....okay, thank you sir!
Hey,
Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.
Hello, are you able to extract more than 100 news links?
Hey,
Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.
Hello, are you able to extract more than 100 news links?
You can get from page 1 to 10, then you will get about 100 news. It can’t do it automatically, and the reason is stated in previous replies.
Hey,
Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.
Hello, are you able to extract more than 100 news links?
You can get from page 1 to 10, then you will get about 100 news. It can’t do it automatically, and the reason is stated in previous replies.
Actually, I wrote a code that extracts top 100 links directly without going through pages 1 to 10. I was wondering if it's possible to get let's say 1,000 links without getting banned. Would I need to use a selenium driver? ` import urllib from bs4 import BeautifulSoup import requests import webbrowser from collections import Counter import csv import pickle
text = 'COVID-19' text = urllib.parse.quote_plus(text)
url = 'https://www.google.co.uk/search'
response = requests.get(url, params={'q':text,'lr':'lang_en','hl':'en','num':'100','cr':'countryUK','tbm':'nws'})#'tbm':'bks',
soup = BeautifulSoup(response.text, 'lxml') print(soup) Slist=list() test= ' ' count=0 for g in soup.find_all("div", {"class": "kCrYT"}): for a in g.find_all("a", { "href": True}): string=a["href"].strip('/url?q=') string_t, rem= string.split('&',1) print(string_t) Slist.append(string_t) `
Hey,
Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.
Hello, are you able to extract more than 100 news links?
You can get from page 1 to 10, then you will get about 100 news. It can’t do it automatically, and the reason is stated in previous replies.
Actually, I wrote a code that extracts top 100 links directly without going through pages 1 to 10. I was wondering if it's possible to get let's say 1,000 links without getting banned. Would I need to use a selenium driver?
`
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
from collections import Counter
import csv
import pickle
BNeawe s3v9rd AP7Wnd for description
BNeawe vvjwJb AP7Wnd for titles
class_='BNeawe vvjwJb AP7Wnd to extract URL headings'
text = 'COVID-19'
text = urllib.parse.quote_plus(text)
url = 'https://www.google.co.uk/search'
response = requests.get(url, params={'q':text,'lr':'lang_en','hl':'en','num':'100','cr':'countryUK','tbm':'nws'})#'tbm':'bks',
soup = BeautifulSoup(response.text, 'lxml')
print(soup)
Slist=list()
test= ' '
count=0
for g in soup.find_all("div", {"class": "kCrYT"}):
for a in g.find_all("a", { "href": True}):
string=a["href"].strip('/url?q=') string_t, rem= string.split('&',1) print(string_t) Slist.append(string_t)
`
Well, if you just make one request to get it, it is fine, using selenium is more safe, it won’t be banned as it is making real requests from browser.
I am having the same problem with no results. Running an example from the README
. Running in Google Colab.
from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.setlang('en')
googlenews.setperiod('d')
googlenews.setTimeRange('02/01/2020','02/28/2020')
googlenews.search('appl')
googlenews.getpage(2)
googlenews.result()
and
from GoogleNews import GoogleNews
googlenews = GoogleNews(start='02/01/2020',end='02/28/2020')
googlenews.search('appl')
resp = googlenews.result()
resp
Same here, no results come back.
` googlenews = GoogleNews(period='d', lang='en')
googlenews.search("AAPL") `
Same here, no results come back.
` googlenews = GoogleNews(period='d', lang='en')
googlenews.search("AAPL") `
make sure your IP address is not blocked by google, if you are running on cloud server, it will have a big chance to be blocked by google, try to open your chrome with google.com and check whether it require to verify robot or not.
Running this in VSCode (and tried in Google Collab as well) without any luck. Just get an empty list returned.
hey @HurinHu , can u tell how many requests can be made in an hour without being blocked by google?
@riyakwl28 normally, delay with few seconds will be fine, don't use while loop without delay or multi-thread.
I am having the same problem with no results. Running an example from the
README
. Running in Google Colab.from GoogleNews import GoogleNews googlenews = GoogleNews() googlenews.setlang('en') googlenews.setperiod('d') googlenews.setTimeRange('02/01/2020','02/28/2020') googlenews.search('appl') googlenews.getpage(2) googlenews.result()
and
from GoogleNews import GoogleNews googlenews = GoogleNews(start='02/01/2020',end='02/28/2020') googlenews.search('appl') resp = googlenews.result() resp
Did you find any solution to your issue?
Hello, I have tried the most basic searches with no success:
This returns 0. Having this issue on v1.3.8
Any help would be appreciated, thank you.