Open SpacaB opened 9 years ago
Good find @SpacaB!
I've done some investigation and may have clues. Newspaper finds news articles in the build()
method by searching and filtering all URLs on the news homepage and all related category and feed pages.
If you look at the page source for "dealbook.nytimes.com/feed"
, you will see a bunch of URLs, some suffixed with "/feed/", as you mentioned, and some not suffixed.
All of the URLs with the "/feed" suffix are enclosed in an <wfw:commentRss>
tag.
I'm not familiar with the wfw protocol, perhaps we should investigate more before deciding what to do.
No matter what we discover, the solution will be clumsy because who expects URLs that point to incorrect locations? If we'd need to insert an if ...
statement to determine when to remove the suffix /feed/
from a URL, that would seriously suck.
@SpacaB http://developer.nytimes.com/docs/times_newswire_api/ Just FYI, if Newspaper isn't working you can use the NYT developer APIs to pull content down.
Hi! I'm really digging Newspaper so far.
I ran into an issue while trying to parse a bunch of articles from the New York Times. It seems that sometimes, an article will get parsed with a random "/feed/" appended to the end of the URL, causing it to look for an article on a bad URL, ultimately resulting in a title reading "404 not found". I don't seem to notice any pattern with these errors, but I attached a picture of part of the output from the code below:
As you can see, the red circles are where the link appends "/feed/" to the URL resulting in a 404 error. If you enter that link into Chrome and remove the "/feed/", it brings you to the actual article.
It seems to occur mostly on dealbook.nytimes.com and xxxx.blogs.nytimes.com URLs.
Any idea why this is happening?