Open bmelton opened 10 years ago
"In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values."
What does that mean? Are you saying that the news link extraction works but the body text extraction fails sometimes? If so, in the new release you will be able to specify custom rules for domains or urls.
Has a hard time digesting all of the article, so it cuts off about half: https://medium.com/@iwasrobbed/the-possibility-of-the-impossible-57d142cc3dd3
Also seeing an issue with TechCrunch articles that have a standard <ul><li>asdf</li></ul>
unordered list element in them. For whatever reason, it doesn't maintain that HTML code in clean_top_node
and just strips it out.
@iwasrobbed The issue which you stated on this URL: http://techcrunch.com/2011/06/13/tech-giant-eats-your-lunch/
Has been fixed in: https://github.com/codelucas/newspaper/pull/106
It will be pushed and deployed into python 2 and 3 branches ASAP
Your issue with your blog post (nice post btw) is currently out of our scope. (The beginning and ending chunks of your article wasn't even in the top_node
) I'm trying to improve FT extraction by tweaking with outputformatters and cleaners for now, will mess with the top_node calculation code later as it's a lot more sensetive
Nice work, @codelucas!
Side note: I'm working with some others on an open source scraper as well, but more based on the original Instapaper recipes for each domain where you just specify xpaths instead of using heuristics. We're using Newspaper as a secondary scraper when we don't have a recipe or certain data. If it ever helps, here is more info about implementation: https://assembly.com/saulify/bounties/30 and https://github.com/asm-products/saulify-web/pull/7/files
Craigslist entries cause errors.
Hi, it doesn't seem to work with these, http://www.risingkashmir.com/news-archive/01-August-2016 http://www.risingkashmir.com/
Any suggestions will be helpful. Thanks.
I've got a running list of URLs that newspaper doesn't work phenomenally against. Is there an open issue to catalogue these? In most cases, it's able to grab the list of articles from the home page, but completely unable to decipher each individual article into readable values.
For example, this link gets basically nothing: http://www.empireonline.com/news/story.asp?NID=40344