Open ayushbits opened 3 years ago
Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution.
The image is being stored in Javascript variable. If we can retrieve the object corresponding to each variable, then image can be fetched. AFAIK, we don't need to load the news article page. Instead, we can fetch the corresponding Javascript variable only from the parsed html page.
रवि, 25 अक्तू॰ 2020, 15:10 को Hurin Hu notifications@github.com ने लिखा:
Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HurinHu/GoogleNews/issues/33#issuecomment-716119478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIB5BSFUPUNFKI6XFY7VBLSMPXB7ANCNFSM4S6GIA2A .
I know, but when you using the script to fetch the page, the js is not executed, the images are dynamic loaded to js. That is what I got last time, will check again later if google has made new changes.
Sure thanks ! Let us know whatever the result would be.
We can get image by using another module inside this. Will it be a convenient way?
Which module? @rbshadow
We can get image by using another module inside this. Will it be a convenient way?
Which module? @rbshadow
We can get image by using another module inside this. Will it be a convenient way?
newspaper3k
Well, currently it can return default loading image, as google load the image through js, so it may need to execute the js to get the correct url. Any fetching script without js execution would not help. I have checked newpaper3k, it uses requests.get() method, which would not help. I am not sure how you get the result, can you post some sample code?
Which module? @rbshadow
We can get image by using another module inside this. Will it be a convenient way?
newspaper3k
Here have attached the full code that I'm using currently.
from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
nltk.download('punkt')
def download_News(data_frame, news_name):
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/50.0.2661.102 Safari/537.36 '
config = Config()
config.browser_user_agent = user_agent
li_st = []
df = data_frame
for ind in df.index:
dict = {}
try:
if news_name == 'Google_News':
article = Article(df['link'][ind], config=config)
article.download()
article.parse()
article.nlp()
dict['Date'] = df['date'][ind]
dict['Title'] = article.title
dict['Top_Image'] = article.top_image
dict['Link'] = df['link'][ind]
li_st.append(dict)
except Exception as e:
print(e)
pass
news_df = pd.DataFrame(li_st)
news_df.to_json(news_name + '_articles.json', orient='index', indent=4) # JSON Output
def google_News(start_date, end_date, search_query):
start_date = start_date.split('-')
start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
end_date = end_date.split('-')
end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]
googlenews = GN(start=start_date, end=end_date)
googlenews.search(search_query)
result = googlenews.result(sort=True)
df = pd.DataFrame(result)
return df
def start():
start_date = input('Enter start date (DD-MM-YYYY): ')
end_date = input('Enter end date (DD-MM-YYYY): ')
search_query = input('Enter Search Query: ')
return start_date, end_date, search_query
if __name__ == '__main__':
query = start()
googleNews = google_News(query[0], query[1], query[2])
download_News(googleNews, news_name='Google_News')
Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.
If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.
Here have attached the full code that I'm using currently.
Code
from GoogleNews import GoogleNews as GN from newspaper import Article from newspaper import Config import pandas as pd import nltk nltk.download('punkt') def download_News(data_frame, news_name): user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \ 'Chrome/50.0.2661.102 Safari/537.36 ' config = Config() config.browser_user_agent = user_agent li_st = [] df = data_frame for ind in df.index: dict = {} try: if news_name == 'Google_News': article = Article(df['link'][ind], config=config) article.download() article.parse() article.nlp() dict['Date'] = df['date'][ind] dict['Title'] = article.title dict['Top_Image'] = article.top_image dict['Link'] = df['link'][ind] li_st.append(dict) except Exception as e: print(e) pass news_df = pd.DataFrame(li_st) news_df.to_json(news_name + '_articles.json', orient='index', indent=4) # JSON Output def google_News(start_date, end_date, search_query): start_date = start_date.split('-') start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2] end_date = end_date.split('-') end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2] googlenews = GN(start=start_date, end=end_date) googlenews.search(search_query) result = googlenews.result(sort=True) df = pd.DataFrame(result) return df def start(): start_date = input('Enter start date (DD-MM-YYYY): ') end_date = input('Enter end date (DD-MM-YYYY): ') search_query = input('Enter Search Query: ') return start_date, end_date, search_query if __name__ == '__main__': query = start() googleNews = google_News(query[0], query[1], query[2]) download_News(googleNews, news_name='Google_News')
Output
Yes you are right. That's why I asked earlier. Also delay is important as you mentioned. Thanks @HurinHu for your great tool.
Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.
If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.
Here have attached the full code that I'm using currently.
Code
from GoogleNews import GoogleNews as GN from newspaper import Article from newspaper import Config import pandas as pd import nltk nltk.download('punkt') def download_News(data_frame, news_name): user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \ 'Chrome/50.0.2661.102 Safari/537.36 ' config = Config() config.browser_user_agent = user_agent li_st = [] df = data_frame for ind in df.index: dict = {} try: if news_name == 'Google_News': article = Article(df['link'][ind], config=config) article.download() article.parse() article.nlp() dict['Date'] = df['date'][ind] dict['Title'] = article.title dict['Top_Image'] = article.top_image dict['Link'] = df['link'][ind] li_st.append(dict) except Exception as e: print(e) pass news_df = pd.DataFrame(li_st) news_df.to_json(news_name + '_articles.json', orient='index', indent=4) # JSON Output def google_News(start_date, end_date, search_query): start_date = start_date.split('-') start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2] end_date = end_date.split('-') end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2] googlenews = GN(start=start_date, end=end_date) googlenews.search(search_query) result = googlenews.result(sort=True) df = pd.DataFrame(result) return df def start(): start_date = input('Enter start date (DD-MM-YYYY): ') end_date = input('Enter end date (DD-MM-YYYY): ') search_query = input('Enter Search Query: ') return start_date, end_date, search_query if __name__ == '__main__': query = start() googleNews = google_News(query[0], query[1], query[2]) download_News(googleNews, news_name='Google_News')
Output
@HurinHu just added some comments to my pull request you closed. Let me know if that makes any difference.
Dear Author @HurinHu ,
Thanks for the package ! Fetching images is still a pain in the code. The images are stored as a javascript object in the self.content variable (shown in image). I tried extracting the value of variable but didn't succeed. Could you try ?