c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
362 stars 110 forks source link

UnicodeDecodeError possibly with Scandinavian letters #33

Closed ghost closed 7 years ago

ghost commented 7 years ago

Command python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.books.2globalnomads.info --image --output sitemap.xml Output UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

With multiple errors: HTTP Error 404: Not Found

ghost commented 7 years ago

Same issue with scandinavian letters here: python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.xetnet.fi --image --output sitemap.xml --verbose

c4software commented 7 years ago

For the first website, its because of ebook, the crawler open the « target » uri, but the content is not really navigable so its fail.

A fix will be included in the next commit.

For the « xetnet.fi », i start the crawler right now i will check this case when the error will appear.

ghost commented 7 years ago

OK. Those are likely due to Scandinavian special letters such as ö and ä. The character encoding in URLs is crucial to be unicode for handling them.

c4software commented 7 years ago

I don’t have the issue anymore with the « xetnef.fi » website.

I also test with french letters « é, à, ô, … » its seems ok.