c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

URL UnicodeEncodeError #79

Open wkingnet opened 2 years ago

wkingnet commented 2 years ago

If the URL contains UNICODE encoding, python will report an error.

debug info:

INFO:root:Crawling #1: https://gvo.wiki/html/NPC掉落書籍.html DEBUG:root:https://gvo.wiki/html/NPC掉落書籍.html ==> 'ascii' codec can't encode characters in position 13-16: ordinal no t in range(128)

Solution:

  1. edit crawler.py Add the following code at the top

    import string
    from urllib.parse import unquote
  2. then search current_url = self.urls_to_crawl.pop()

  3. add a line below

    current_url = self.urls_to_crawl.pop()
    current_url = quote(current_url, safe=string.printable)