Iceloof / GoogleNews

Script for GoogleNews
https://pypi.org/project/GoogleNews/
MIT License
316 stars 88 forks source link

No Results #25

Closed mattliscia closed 3 years ago

mattliscia commented 4 years ago

Hello, I have tried the most basic searches with no success:

googlenews = GoogleNews()
googlenews.search('Trump')
result = googlenews.result()
print(len(result))

This returns 0. Having this issue on v1.3.8

Any help would be appreciated, thank you.

HurinHu commented 4 years ago

Hello, I have tried the most basic searches with no success:

googlenews = GoogleNews()
googlenews.search('Trump')
result = googlenews.result()
print(len(result))

This returns 0. Having this issue on v1.3.8

Any help would be appreciated, thank you.

Are you running this on the cloud server? have you tried to run it locally or with different IP? This issue mostly happened when your platform is recognized as robot by google.

mattliscia commented 4 years ago

I am running it locally. What do you mean by cloud server?

HurinHu commented 4 years ago

Can you try this and see if it can return or not

import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0'}
url = 'https://news.google.com/'
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
page = response.read()
mattliscia commented 4 years ago

Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.

onseau commented 4 years ago

HurinHu, do you know how to modify it to get the total number of results over an unknown number of pages? And thank you so much for making this available!

HurinHu commented 4 years ago

Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.

While, it is wired, I have tried your code on my terminal, and it can return number 10

HurinHu commented 4 years ago

HurinHu, do you know how to modify it to get the total number of results over an unknown number of pages? And thank you so much for making this available!

Currently, it seems impossible to get total number of pages, because when you search from Google, it always shows you first 10 pages at the bottom, there is no way to get the total number

onseau commented 4 years ago

it's not possible to make a loop? like if the # of results from page 2 = 10, then try pgae 3?

HurinHu commented 4 years ago

it's not possible to make a loop? like if the # of results from page 2 = 10, then try pgae 3?

if there is one hundred pages, you will need to wait a long time to get the result, and there is a risk, google might recognize you as robot and block your IP

onseau commented 4 years ago

trust me, there's usually only a few pages max. the "# of results" thing is only accurate on the last page. umm...but do you how to write it? but i don't know python. or any language really.

and is there a "rest" or "back off" command to prevent triggering their roboID?

HurinHu commented 4 years ago

trust me, there's usually only a few pages max. the "# of results" thing is only accurate on the last page. umm...but do you how to write it? but i don't know python. or any language really.

even it is just 10 pages, it need to create 10 requests in a short time, it is not a good way to do it with a lot of uncertainty. If you want to do it, you can use while loop with increasing page number until it get empty result.

onseau commented 4 years ago

i see. from my experience trying this manually, even the hottest topics only get a few hundred results. o... is the number of results per page set already

HurinHu commented 4 years ago

i see. from my experience trying this manually, even the hottest topics only get a few hundred results. o... is the number of results per page set already

normally ten records per page

onseau commented 4 years ago

ahh yea i see....okay, thank you sir!

bloodjason27 commented 4 years ago

Hey,

Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.

Hello, are you able to extract more than 100 news links?

HurinHu commented 4 years ago

Hey,

Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.

Hello, are you able to extract more than 100 news links?

You can get from page 1 to 10, then you will get about 100 news. It can’t do it automatically, and the reason is stated in previous replies.

bloodjason27 commented 4 years ago

Hey,

Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.

Hello, are you able to extract more than 100 news links?

You can get from page 1 to 10, then you will get about 100 news. It can’t do it automatically, and the reason is stated in previous replies.

Actually, I wrote a code that extracts top 100 links directly without going through pages 1 to 10. I was wondering if it's possible to get let's say 1,000 links without getting banned. Would I need to use a selenium driver? ` import urllib from bs4 import BeautifulSoup import requests import webbrowser from collections import Counter import csv import pickle

BNeawe s3v9rd AP7Wnd for description

BNeawe vvjwJb AP7Wnd for titles

class_='BNeawe vvjwJb AP7Wnd to extract URL headings'

text = 'COVID-19' text = urllib.parse.quote_plus(text)

url = 'https://www.google.co.uk/search'

response = requests.get(url, params={'q':text,'lr':'lang_en','hl':'en','num':'100','cr':'countryUK','tbm':'nws'})#'tbm':'bks',

soup = BeautifulSoup(response.text, 'lxml') print(soup) Slist=list() test= ' ' count=0 for g in soup.find_all("div", {"class": "kCrYT"}): for a in g.find_all("a", { "href": True}): string=a["href"].strip('/url?q=') string_t, rem= string.split('&',1) print(string_t) Slist.append(string_t) `

HurinHu commented 4 years ago

Hey,

Yes that code works fine. In fact I can scrape news.google.com with Beautiful Soup, so I'm assuming that they are not identifying me as a robot.

Hello, are you able to extract more than 100 news links?

You can get from page 1 to 10, then you will get about 100 news. It can’t do it automatically, and the reason is stated in previous replies.

Actually, I wrote a code that extracts top 100 links directly without going through pages 1 to 10. I was wondering if it's possible to get let's say 1,000 links without getting banned. Would I need to use a selenium driver?

`

import urllib

from bs4 import BeautifulSoup

import requests

import webbrowser

from collections import Counter

import csv

import pickle

BNeawe s3v9rd AP7Wnd for description

BNeawe vvjwJb AP7Wnd for titles

class_='BNeawe vvjwJb AP7Wnd to extract URL headings'

text = 'COVID-19'

text = urllib.parse.quote_plus(text)

url = 'https://www.google.co.uk/search'

response = requests.get(url, params={'q':text,'lr':'lang_en','hl':'en','num':'100','cr':'countryUK','tbm':'nws'})#'tbm':'bks',

soup = BeautifulSoup(response.text, 'lxml')

print(soup)

Slist=list()

test= ' '

count=0

for g in soup.find_all("div", {"class": "kCrYT"}):

for a in g.find_all("a", { "href": True}):

  string=a["href"].strip('/url?q=')

  string_t, rem= string.split('&',1)

  print(string_t)

  Slist.append(string_t)

`

Well, if you just make one request to get it, it is fine, using selenium is more safe, it won’t be banned as it is making real requests from browser.

jjphung commented 4 years ago

I am having the same problem with no results. Running an example from the README. Running in Google Colab.

from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.setlang('en')
googlenews.setperiod('d')
googlenews.setTimeRange('02/01/2020','02/28/2020')
googlenews.search('appl')
googlenews.getpage(2)
googlenews.result()

and

from GoogleNews import GoogleNews
googlenews = GoogleNews(start='02/01/2020',end='02/28/2020')
googlenews.search('appl')
resp = googlenews.result()
resp
nectario commented 4 years ago

Same here, no results come back.

` googlenews = GoogleNews(period='d', lang='en')

googlenews.search("AAPL") `

HurinHu commented 4 years ago

Same here, no results come back.

` googlenews = GoogleNews(period='d', lang='en')

googlenews.search("AAPL") `

make sure your IP address is not blocked by google, if you are running on cloud server, it will have a big chance to be blocked by google, try to open your chrome with google.com and check whether it require to verify robot or not.

MrUltimate commented 3 years ago

Running this in VSCode (and tried in Google Collab as well) without any luck. Just get an empty list returned.

Screen Shot 2020-07-07 at 4 59 44 PM

riyakwl28 commented 3 years ago

hey @HurinHu , can u tell how many requests can be made in an hour without being blocked by google?

HurinHu commented 3 years ago

@riyakwl28 normally, delay with few seconds will be fine, don't use while loop without delay or multi-thread.

dr-alberto commented 3 years ago

I am having the same problem with no results. Running an example from the README. Running in Google Colab.

from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.setlang('en')
googlenews.setperiod('d')
googlenews.setTimeRange('02/01/2020','02/28/2020')
googlenews.search('appl')
googlenews.getpage(2)
googlenews.result()

and

from GoogleNews import GoogleNews
googlenews = GoogleNews(start='02/01/2020',end='02/28/2020')
googlenews.search('appl')
resp = googlenews.result()
resp

Did you find any solution to your issue?