Closed AndyTheFactory closed 9 months ago
Comment by tensor5375 Sun Mar 19 22:04:04 2017
At first glance, it seems like a good module. But it can not treat Unicode documents without problem. It spills many exceptions when parsing non english documents. Contributors must test and avoid/fix these flaws if you all contributors want make it great.
It almost works good in english texts but fails to parse some sites(e.g. vice.com times.com).
Issue by tensor5375 Thu Mar 16 07:12:09 2017 Originally opened as https://github.com/codelucas/newspaper/issues/346
I've tried this library and follow your tutorials. But newspaper spills 'can't encode character error' when parsing. My code is below.
crawler_conf
= Config() crawler_conf.MAX_SUMMARY = 500 crawler_conf.MAX_SUMMARY_SENT = max_sentences crawler_conf.memoize_articles = Falsecrawl
This error does not occur on english texts and this library seems implementation is very vulnerable. Can I avoid this problem ?