Date of article is not fetched properly

apavlo89 commented 3 years ago

Instead of getting an exact date I get '1 month ago' in the results document. How can i fix that? Thank you for your help

from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
#config will allow us to access the specified url for which we are #not authorized. Sometimes we may get 403 client error while parsing #the link to download the article.
nltk.download('punkt')

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 5
googlenews=GoogleNews(start='10/19/2020',end='10/19/2020')
googlenews.search('test')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,5):
    googlenews.getpage(i)
    result=googlenews.result()
    df=pd.DataFrame(result)
list=[]
for ind in df.index:
    dict={}
    article = Article(df['link'][ind],config=config)
    try:
        article.download()
        article.parse()
        article2 = article.text.split()
    except:
        print('***FAILED TO DOWNLOAD***', article.url)
        continue
    # article.download()
    # article.parse()
    article.nlp()

    dict['Date']=df['date'][ind]
    dict['Media']=df['media'][ind]
    dict['Title']=article.title
    dict['Article']=article.text
    dict['Summary']=article.summary
    list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")

HurinHu commented 3 years ago

It seems google only return this text rather than exact date. So probably can not be fixed at the moment.

apavlo89 commented 3 years ago

Its weird because it works with older dates. Like if i try for some dates in 2017 it gives me the exact date

HurinHu commented 3 years ago

Yes, the recent news are not showing exact date, only older, you can check your google search, same format.

apavlo89 commented 3 years ago

hmmm. I guess a workaround would be to set start and end date to the same day and code it so that the set/end date is timestamped into the dataframe. Then have the script loop and go one day up each time and inserting the right date until it reaches a specific date you set. Anyone who is code-savvy enough to do this?

HurinHu commented 3 years ago

That could be a solution, but I don't suggest that, you may be blocked by google.

apavlo89 commented 3 years ago

Also I was thinking about how to get avoid getting blocked by google. Perhaps there's a proxy python script that connects to a random proxy server and this code could be put into the for loop below. I'll have a look and see if this is possible.

for ind in df.index:
    dict={}
    article = Article(df['link'][ind],config=config)
    try:
        article.download()
        article.parse()
        article2 = article.text.split()
    except:
        print('***FAILED TO DOWNLOAD***', article.url)
        continue
    # article.download()
    # article.parse()
    article.nlp()

    dict['Date']=df['date'][ind]
    dict['Media']=df['media'][ind]
    dict['Title']=article.title
    dict['Article']=article.text
    dict['Summary']=article.summary
    list.append(dict)

HurinHu commented 3 years ago

You can add some random delay time between two requests

apavlo89 commented 3 years ago

I see. Would something like this work? What do you use/suggest?

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 3
googlenews=GoogleNews(start='01/01/2018',end='12/29/2018')
googlenews.search('test')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,365):
    googlenews.getpage(i)
    result=googlenews.result()
    df=pd.DataFrame(result)
    time.sleep(random.randint(1,30)) #something like this? Would this be correct?
list=[]
for ind in df.index:
    dict={}
    article = Article(df['link'][ind],config=config)
    try:
        article.download()
        article.parse()
        article2 = article.text.split()
    except:
        print('***FAILED TO DOWNLOAD***', article.url)
        continue
    # article.download()
    # article.parse()
    article.nlp()

    dict['Date']=df['date'][ind]
    dict['Media']=df['media'][ind]
    dict['Title']=article.title
    dict['Article']=article.text
    dict['Summary']=article.summary
    list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_csv("articles.csv")

HurinHu commented 3 years ago

You can do in this way.

Iceloof / GoogleNews

Date of article is not fetched properly #42