AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
480 stars 48 forks source link

'cp932' codec can't encode character '\u0388' in position 23: illegal multibyte sequence #86

Closed AndyTheFactory closed 9 months ago

AndyTheFactory commented 1 year ago

Issue by tensor5375 Thu Mar 16 07:12:09 2017 Originally opened as https://github.com/codelucas/newspaper/issues/346


I've tried this library and follow your tutorials. But newspaper spills 'can't encode character error' when parsing. My code is below.

crawler_conf = Config() crawler_conf.MAX_SUMMARY = 500 crawler_conf.MAX_SUMMARY_SENT = max_sentences crawler_conf.memoize_articles = False

crawl

    for site in sites:
        with open("dbg.txt","w",encoding="utf-8") as f:
            #building newspaper
            cnn_paper = newspaper.build(site, config=crawler_conf)
            cnt = 1
            for art in cnn_paper.articles:                      
                title = ""
                text = ""
                try:
                    if(True):
                        #download
                        art.download()
                        while False == art.is_downloaded:
                            continue
                        art.parse()
                        while False == art.is_parsed:
                            continue
                        title = art.title
                        text = art.text

This error does not occur on english texts and this library seems implementation is very vulnerable. Can I avoid this problem ?

AndyTheFactory commented 1 year ago

Comment by tensor5375 Sun Mar 19 22:04:04 2017


At first glance, it seems like a good module. But it can not treat Unicode documents without problem. It spills many exceptions when parsing non english documents. Contributors must test and avoid/fix these flaws if you all contributors want make it great.

It almost works good in english texts but fails to parse some sites(e.g. vice.com times.com).