codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.17k stars 2.12k forks source link

Iterating articles on news source produces duplicates, if subdomain omitted. #580

Open awiebe opened 6 years ago

awiebe commented 6 years ago

I was testing news sources, and found that this article was emitted twice, despite the fact that newspaper should be memoizing. The problem seems to be that memoization uses the straight url and doesn't consider that the second source is missing the www subdomain.

https://www.theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
http://theatlantic.com/politics/archive/2018/05/stephen-miller-trump-adviser/561317/
Trump’s Right-Hand Troll
['Mckay Coppins']
import newspaper

def dump_article(a):
    try:
        a.download()
        a.parse()
        print(a.title)
        print(a.authors)
        # print (a.text)
        return True
    except :
        return False

MAX_PULL=10

for source in newspaper.popular_urls():
    print(source)
    pull=0
    s=newspaper.build(source,lang='en')
    for a in s.articles:
        print(a.url)
        if dump_article(a):
            pull+=1
        if pull>= MAX_PULL:
            break
minhdanh commented 3 years ago

Having same problem here in 2021. @awiebe Have you by any chance had a solution?