导入其他模型的symbols时发生UnicodeDecodeError

LiebeVerletzt commented 1 year ago

报错为：UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 1337: illegal multibyte sequence 查了一下应该是默认解码不是utf-8的问题。期望导入的symbols如下：【 def changeSymbols(self, type): if type == 1: self.symbols = list(' !"&*,-.?ABCINU[]abcdefghijklmnoprstuwyz{}~') self.type = 1 elif type == 2: pad = '' _punctuation = ',.!?-' _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧ↓↑ ' self.symbols = [_pad] + list(_punctuation) + list(_letters) self.type = 2 elif type == 3: pad = '' _punctuation = ',.!?-~…' _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧʦ↓↑ ' self.symbols = [_pad] + list(_punctuation) + list(_letters) self.type = 3 elif type == 4: pad = '' _punctuation = ';:,.!?¡¿—…"«»“” ' _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ" self.symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) self.type = 4 】

除此之外，可以正常转换方式导入的 type==1 转码为： "symbols": [" ","!","\"","&","*",",","-",".","?","A","B","C","I","N","U","[","]","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","r","s","t","u","w","y","z","{","}","~"] 后，也出现了以下问题：

语调不同，有可能是暂时不支持 japanese_tokenization_cleaners 强行转换的原因；
会把部分字读错，比如あ读成が，し读成ぃ之类的问题，可能和symbols的导入方式有关。

期望添加的改进：对非gbk字符的支持，对japanese_tokenization_cleaners的支持以及对其他非标准模型导入方式的引导。

LiebeVerletzt commented 1 year ago

japanese_tokenization_cleaners源码应该是这个： def japanese_tokenization_cleaners(text): '''Pipeline for tokenizing Japanese text.''' words = [] for token in tokenizer.tokenize(text): if token.phonetic != '*': words.append(token.phonetic) else: words.append(token.surface) text = '' for word in words: if re.match(_japanese_characters, word): if word[0] == '\u30fc': continue if len(text) > 0: text += ' ' text += pyopenjtalk.g2p(word, kana=False).replace(' ', '') else: text += unidecode(word).replace(' ', '') if re.match('[A-Za-z]', text[-1]): text += '.' return text

dpm12345 commented 1 year ago

确实是这方面，cleaner用的是另一种方法，之后我有空加上去吧

LiebeVerletzt commented 1 year ago

japanese_tokenization_cleaners我自己加好了，但UnicodeDecodeError的问题还是存在，请问有没有修改的方法？

LiebeVerletzt commented 1 year ago

现在我的解决方法是强制系统使用utf-8，但是无论哪里都看起来怪怪的……（虽然能跑了）

dpm12345 commented 1 year ago

我在我这边使用该cleaner是正常的,检查一下哪里出错了？这个与系统编码应该是无关的

dpm12345 commented 1 year ago

只需要在text文件夹下的cleaners.py添加该段cleaner代码，然后配置文件的 text_cleaners项为["japanese_tokenization_cleaners"] symbols填上以下即可 [" ", "!", "\"", "&", "*", ",", "-", ".", "?", "A", "B", "C", "I", "N", "U", "[", "]", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "r", "s", "t", "u", "w", "y", "z", "{", "}", "~"]

LiebeVerletzt commented 1 year ago

windows 10的默认编码不是utf-8，而我运行nb是用批处理文件，所以编码和系统保持一致的，所以我就把系统编码改成utf-8了

LiebeVerletzt commented 1 year ago

在utils.py的第64行改为： with open(config_path, "r",encoding='utf-8') as f: 就解决了；因为有些模型使用的symbols包含gbk非法字符，和cmd使用什么编码无关。回头一看去年的我真是笨蛋来的 x_x

dpm12345 / nonebot_plugin_tts_gal

导入其他模型的symbols时发生UnicodeDecodeError #22