codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

I get back the same description on all these links, although they are clearly different. #940

Open gaurav-95 opened 2 years ago

gaurav-95 commented 2 years ago

They all return the same description for me when running on a local environment in Windows as well as ec2 linstance in linux.

["https://fortune.com/2022/05/01/germany-says-dependence-on-russian-oil-could-end-in-late-summer/", "https://fortune.com/2022/05/01/raytheon-union-reach-labor-deal-at-key-jet-engine-plants/", "https://fortune.com/2022/05/01/bored-ape-metaverse-frenzy-raises-millions-crashes-ethereum/", "https://fortune.com/2022/05/01/buffett-lures-omaha-disciples-with-stock-buys-inflation-warning/", "https://fortune.com/2022/05/01/omicron-sublineages-can-evade-antibodies-from-earlier-infections-south-africa/", "https://fortune.com/2022/05/01/biden-roasts-trump-gop-himself-at-correspondents-dinner/", "https://fortune.com/2022/05/01/amazon-union-face-off-rematch-election-new-york/", "https://fortune.com/2022/05/01/damage-done-dont-know-health-experts-slow-to-criticize-fauci-but-quick-to-correct-his-claim-that-we-are-out-of-the-pandemic/", "https://fortune.com/2022/05/01/evidence-mounts-gop-involvement-in-trump-election-schemes/", "https://fortune.com/2022/05/01/house-speaker-nancy-pelosi-meets-with-ukraines-zelenskiy-kyiv-gregory-meeks-jason-crow/", "https://fortune.com/2022/05/01/china-contagion-threatens-to-derail-the-worlds-emerging-markets/", "https://fortune.com/2022/05/01/jim-ratcliffes-5-billion-chelsea-bid-too-low-times-says/", "https://fortune.com/2022/05/01/powells-fed-set-to-go-big-keep-going-until-inflation-tamed/"]

For all these articles i get the descripton as:

‘Was damage done? I don’t know’: Health experts are slow to criticize Fauci but quick to correct his claim that we are ‘out of the pandemic’

This is my code:

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124  Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 30
config.fetch_images = False

def get_article(link):
    article = Article(link, config=config)
    article.download()
    article.parse()
    text = article.text

    print("-"*50,"\nTEXT FROM URL: \n", text)
    return text

I ran a for loop over the list of the articles mentioned above to see the same output. Also something to note they are all fortune.com articles. Really confused with this.. Any help would be appreciated. Also this started from 1st May, 2022 if that is of any help.