Closed cdprf closed 4 years ago
Hi @cdprf,
It looks like that on Windows, Python uses the default encoding for your system (Windows-1254) that corresponds to the Turkish charset. The site you are scraping seems to have characters that cannot be encoded in Windows-1254.
This is not expected. I assumed that the default charset for exporting data would be UTF-8 (as the WP REST API should be in almost every cases).
I will commit a little patch forcing the output documents in UTF-8. It may be interesting in the future to let the user choose the output encoding.
Clone the master branch and let me know if this fixes the issue for you, thanks.
I assume that this problem is now fixed. If another similar encoding problem is found, reference this issue in a new one.
python3 WPJsonScraper.py http://OOOOOO.com --export-posts /posts
[94m[*] Testing connectivity with the server[0m [92m[+] Connection OK[0m Number of entries: 991 |██████████████████████████████████████████████████████████████████████| 100.0% Number of entries: 635 |██████████████████████████████████████████████████████████████████████| 100.0% Number of entries: 7 |██████████████████████████████████████████████████████████████████████| 100.0% Number of entries: 2 |██████████████████████████████████████████████████████████████████████| 100.0%
Traceback (most recent call last): File "WPJsonScraper.py", line 360, in
main()
File "WPJsonScraper.py", line 319, in main
post_number = Exporter.export_posts(posts_list,
File "C:\deneme\wp-json-scraper\lib\exporter.py", line 210, in export_posts
post_file.write(buffer)
File "C:\Python38\lib\encodings\cp1254.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6146-6152: character maps to