codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.1k stars 2.11k forks source link

Sites it doesn't work on #43

Open bmelton opened 10 years ago

bmelton commented 10 years ago

I've got a running list of URLs that newspaper doesn't work phenomenally against. Is there an open issue to catalogue these? In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values.

For example, this link gets basically nothing: http://www.empireonline.com/news/story.asp?NID=40344

codelucas commented 10 years ago

"In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values."

What does that mean? Are you saying that the news link extraction works but the body text extraction fails sometimes? If so, in the new release you will be able to specify custom rules for domains or urls.

iwasrobbed commented 9 years ago

Has a hard time digesting all of the article, so it cuts off about half: https://medium.com/@iwasrobbed/the-possibility-of-the-impossible-57d142cc3dd3

iwasrobbed commented 9 years ago

Also seeing an issue with TechCrunch articles that have a standard <ul><li>asdf</li></ul> unordered list element in them. For whatever reason, it doesn't maintain that HTML code in clean_top_node and just strips it out.

codelucas commented 9 years ago

@iwasrobbed The issue which you stated on this URL: http://techcrunch.com/2011/06/13/tech-giant-eats-your-lunch/

Has been fixed in: https://github.com/codelucas/newspaper/pull/106

It will be pushed and deployed into python 2 and 3 branches ASAP

Your issue with your blog post (nice post btw) is currently out of our scope. (The beginning and ending chunks of your article wasn't even in the top_node) I'm trying to improve FT extraction by tweaking with outputformatters and cleaners for now, will mess with the top_node calculation code later as it's a lot more sensetive

iwasrobbed commented 9 years ago

Nice work, @codelucas!

Side note: I'm working with some others on an open source scraper as well, but more based on the original Instapaper recipes for each domain where you just specify xpaths instead of using heuristics. We're using Newspaper as a secondary scraper when we don't have a recipe or certain data. If it ever helps, here is more info about implementation: https://assembly.com/saulify/bounties/30 and https://github.com/asm-products/saulify-web/pull/7/files

bendavies commented 9 years ago

@codelucas http://blogs.wsj.com/moneybeat/2015/02/23/apple-is-now-more-than-double-the-size-of-exxon-and-everyone-else/

austincondiff commented 9 years ago

Craigslist entries cause errors.

zairahms commented 8 years ago

Hi, it doesn't seem to work with these, http://www.risingkashmir.com/news-archive/01-August-2016 http://www.risingkashmir.com/

Any suggestions will be helpful. Thanks.