Iceloof / GoogleNews

Script for GoogleNews
https://pypi.org/project/GoogleNews/
MIT License
316 stars 88 forks source link

Any success with fetching images ! #33

Open ayushbits opened 3 years ago

ayushbits commented 3 years ago

Dear Author @HurinHu ,

Thanks for the package ! Fetching images is still a pain in the code. The images are stored as a javascript object in the self.content variable (shown in image). I tried extracting the value of variable but didn't succeed. Could you try ?

image

HurinHu commented 3 years ago

Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution.

ayushbits commented 3 years ago

The image is being stored in Javascript variable. If we can retrieve the object corresponding to each variable, then image can be fetched. AFAIK, we don't need to load the news article page. Instead, we can fetch the corresponding Javascript variable only from the parsed html page.

रवि, 25 अक्तू॰ 2020, 15:10 को Hurin Hu notifications@github.com ने लिखा:

Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HurinHu/GoogleNews/issues/33#issuecomment-716119478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIB5BSFUPUNFKI6XFY7VBLSMPXB7ANCNFSM4S6GIA2A .

HurinHu commented 3 years ago

I know, but when you using the script to fetch the page, the js is not executed, the images are dynamic loaded to js. That is what I got last time, will check again later if google has made new changes.

ayushbits commented 3 years ago

Sure thanks ! Let us know whatever the result would be.

rbshadow commented 3 years ago

We can get image by using another module inside this. Will it be a convenient way?

HurinHu commented 3 years ago

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

rbshadow commented 3 years ago

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

newspaper3k

HurinHu commented 3 years ago

Well, currently it can return default loading image, as google load the image through js, so it may need to execute the js to get the correct url. Any fetching script without js execution would not help. I have checked newpaper3k, it uses requests.get() method, which would not help. I am not sure how you get the result, can you post some sample code?

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

newspaper3k

rbshadow commented 3 years ago

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')

def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output

def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df

def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query

if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

image

HurinHu commented 3 years ago

Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.

If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')

def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output

def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df

def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query

if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

image

rbshadow commented 3 years ago

Yes you are right. That's why I asked earlier. Also delay is important as you mentioned. Thanks @HurinHu for your great tool.

Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.

If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')

def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output

def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df

def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query

if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

image

jacobhtye commented 2 years ago

@HurinHu just added some comments to my pull request you closed. Let me know if that makes any difference.