brightmart / nlp_chinese_corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
MIT License
9.41k stars 1.54k forks source link

json转换 #45

Open nissansz opened 1 year ago

nissansz commented 1 year ago

json文件里存的的是unicode编码 "text":"\u30a2\u30d5\u30ea\u30ab \u30a2\u30d5\u30ea\u30ab\uff08\u82f1\u00a0:

    lines1 = f1.read()
    lines1  = lines1 .encode('utf-8').decode("unicode_escape")

print(path1+':'+line)

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 118-119: surrogates not allowed

这个错误怎么解决?