MickaelWalter / wp-json-scraper

Scrapes WordPress data using the WP-JSON API activated by default since WordPress 4.7
MIT License
97 stars 26 forks source link

Encoding error on Windows pyton3 #4

Closed cdprf closed 4 years ago

cdprf commented 4 years ago

python3 WPJsonScraper.py http://OOOOOO.com --export-posts /posts

[*] Testing connectivity with the server [+] Connection OK Number of entries: 991 |██████████████████████████████████████████████████████████████████████| 100.0% Number of entries: 635 |██████████████████████████████████████████████████████████████████████| 100.0% Number of entries: 7 |██████████████████████████████████████████████████████████████████████| 100.0% Number of entries: 2 |██████████████████████████████████████████████████████████████████████| 100.0%

Traceback (most recent call last): File "WPJsonScraper.py", line 360, in main() File "WPJsonScraper.py", line 319, in main post_number = Exporter.export_posts(posts_list, File "C:\deneme\wp-json-scraper\lib\exporter.py", line 210, in export_posts post_file.write(buffer) File "C:\Python38\lib\encodings\cp1254.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 6146-6152: character maps to

MickaelWalter commented 4 years ago

Hi @cdprf,

It looks like that on Windows, Python uses the default encoding for your system (Windows-1254) that corresponds to the Turkish charset. The site you are scraping seems to have characters that cannot be encoded in Windows-1254.

This is not expected. I assumed that the default charset for exporting data would be UTF-8 (as the WP REST API should be in almost every cases).

I will commit a little patch forcing the output documents in UTF-8. It may be interesting in the future to let the user choose the output encoding.

Clone the master branch and let me know if this fixes the issue for you, thanks.

MickaelWalter commented 4 years ago

I assume that this problem is now fixed. If another similar encoding problem is found, reference this issue in a new one.