codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.13k stars 2.12k forks source link

Newspaper adding "/feed/" to URLs, resulting in parsing problems #100

Open SpacaB opened 9 years ago

SpacaB commented 9 years ago

Hi! I'm really digging Newspaper so far.

I ran into an issue while trying to parse a bunch of articles from the New York Times. It seems that sometimes, an article will get parsed with a random "/feed/" appended to the end of the URL, causing it to look for an article on a bad URL, ultimately resulting in a title reading "404 not found". I don't seem to notice any pattern with these errors, but I attached a picture of part of the output from the code below:

from newspaper import Source
nyt_paper = Source('http://nytimes.com', memoize_articles=False)

print nyt_paper.size()
nyt_paper.build()
nyt_paper.size()

from newspaper import news_pool
papers = [nyt_paper]
news_pool.set(papers, threads_per_source=4)
news_pool.join()

# Parse 100 articles
for i in range(0,100):
    nyt_paper.articles[i].parse()

# Print out each of the 100 articles' URL and title
title_list = []
for i in range(0,100):
    print nyt_paper.articles[i].url
    print nyt_paper.articles[i].title
    print "----------------------------------------------------"
    title_list.append(nyt_paper.articles[i].title)

404 error

As you can see, the red circles are where the link appends "/feed/" to the URL resulting in a 404 error. If you enter that link into Chrome and remove the "/feed/", it brings you to the actual article.

It seems to occur mostly on dealbook.nytimes.com and xxxx.blogs.nytimes.com URLs.

Any idea why this is happening?

codelucas commented 9 years ago

Good find @SpacaB!

I've done some investigation and may have clues. Newspaper finds news articles in the build() method by searching and filtering all URLs on the news homepage and all related category and feed pages.

If you look at the page source for "dealbook.nytimes.com/feed", you will see a bunch of URLs, some suffixed with "/feed/", as you mentioned, and some not suffixed.

All of the URLs with the "/feed" suffix are enclosed in an <wfw:commentRss> tag. I'm not familiar with the wfw protocol, perhaps we should investigate more before deciding what to do.

No matter what we discover, the solution will be clumsy because who expects URLs that point to incorrect locations? If we'd need to insert an if ... statement to determine when to remove the suffix /feed/ from a URL, that would seriously suck.

Kharms commented 9 years ago

@SpacaB http://developer.nytimes.com/docs/times_newswire_api/ Just FYI, if Newspaper isn't working you can use the NYT developer APIs to pull content down.