Closed apavlo89 closed 3 years ago
It seems google only return this text rather than exact date. So probably can not be fixed at the moment.
Its weird because it works with older dates. Like if i try for some dates in 2017 it gives me the exact date
Yes, the recent news are not showing exact date, only older, you can check your google search, same format.
hmmm. I guess a workaround would be to set start and end date to the same day and code it so that the set/end date is timestamped into the dataframe. Then have the script loop and go one day up each time and inserting the right date until it reaches a specific date you set. Anyone who is code-savvy enough to do this?
That could be a solution, but I don't suggest that, you may be blocked by google.
Also I was thinking about how to get avoid getting blocked by google. Perhaps there's a proxy python script that connects to a random proxy server and this code could be put into the for loop below. I'll have a look and see if this is possible.
for ind in df.index:
dict={}
article = Article(df['link'][ind],config=config)
try:
article.download()
article.parse()
article2 = article.text.split()
except:
print('***FAILED TO DOWNLOAD***', article.url)
continue
# article.download()
# article.parse()
article.nlp()
dict['Date']=df['date'][ind]
dict['Media']=df['media'][ind]
dict['Title']=article.title
dict['Article']=article.text
dict['Summary']=article.summary
list.append(dict)
You can add some random delay time between two requests
I see. Would something like this work? What do you use/suggest?
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 3
googlenews=GoogleNews(start='01/01/2018',end='12/29/2018')
googlenews.search('test')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,365):
googlenews.getpage(i)
result=googlenews.result()
df=pd.DataFrame(result)
time.sleep(random.randint(1,30)) #something like this? Would this be correct?
list=[]
for ind in df.index:
dict={}
article = Article(df['link'][ind],config=config)
try:
article.download()
article.parse()
article2 = article.text.split()
except:
print('***FAILED TO DOWNLOAD***', article.url)
continue
# article.download()
# article.parse()
article.nlp()
dict['Date']=df['date'][ind]
dict['Media']=df['media'][ind]
dict['Title']=article.title
dict['Article']=article.text
dict['Summary']=article.summary
list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_csv("articles.csv")
You can do in this way.
Instead of getting an exact date I get '1 month ago' in the results document. How can i fix that? Thank you for your help