Kaggle / kaggle-api

Official Kaggle API
Apache License 2.0
6.28k stars 1.1k forks source link

Encoding issues when pushing a kernel #186

Open dk-forestry opened 5 years ago

dk-forestry commented 5 years ago

I'm running into an error when trying to push a kernel via the API (Python 3.6, kaggle API 1.5.3).

This code file works:

print('hello world')

This code file:

print('ẁ')

produces this error:

'charmap' codec can't decode byte 0x81 in position 9: character maps to undefined

The issue seems to be similar to #146. It is rather critical for NLP tasks, because 'weird' characters are often hardcoded in the script (e.g. the two most-upvoted public kernels in the current Jigsaw competition can't be submitted via the API because of this issue).

I also posted this question in the Kaggle forum, but haven't received an answer yet. I will cross-update any responses I'll get (either here or on Kaggle).

isshiki commented 2 years ago

This problem also seems to occur in the Japanese Windows environment.

For example, the system encoding of Japanese Windows is 'cp932' codec. In this situation, if you open and read a file with the code with open(code_file) as f:, you will get an error like "'cp932' codec can't decode byte 0xef". https://github.com/Kaggle/kaggle-api/blob/master/kaggle/api/kaggle_api_extended.py#L1868

It may be more appropriate to use the code with open(code_file, encoding='utf-8') as f: when opening a file. If the encoding of Python scripts and notebooks is assumed to be UTF-8, I don't think this will have any impact and would like to see it fixed.