AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.

MIT License

480 stars 48 forks source link

Issue by tensor5375 Thu Mar 16 07:12:09 2017 Originally opened as https://github.com/codelucas/newspaper/issues/346

I've tried this library and follow your tutorials. But newspaper spills 'can't encode character error' when parsing. My code is below.

crawler_conf = Config() crawler_conf.MAX_SUMMARY = 500 crawler_conf.MAX_SUMMARY_SENT = max_sentences crawler_conf.memoize_articles = False

crawl

    for site in sites:
        with open("dbg.txt","w",encoding="utf-8") as f:
            #building newspaper
            cnn_paper = newspaper.build(site, config=crawler_conf)
            cnt = 1
            for art in cnn_paper.articles:                      
                title = ""
                text = ""
                try:
                    if(True):
                        #download
                        art.download()
                        while False == art.is_downloaded:
                            continue
                        art.parse()
                        while False == art.is_parsed:
                            continue
                        title = art.title
                        text = art.text

This error does not occur on english texts and this library seems implementation is very vulnerable. Can I avoid this problem ?

AndyTheFactory / newspaper4k

'cp932' codec can't encode character '\u0388' in position 23: illegal multibyte sequence #86

crawl