Open LiebeVerletzt opened 1 year ago
japanese_tokenization_cleaners源码应该是这个: def japanese_tokenization_cleaners(text): '''Pipeline for tokenizing Japanese text.''' words = [] for token in tokenizer.tokenize(text): if token.phonetic != '*': words.append(token.phonetic) else: words.append(token.surface) text = '' for word in words: if re.match(_japanese_characters, word): if word[0] == '\u30fc': continue if len(text) > 0: text += ' ' text += pyopenjtalk.g2p(word, kana=False).replace(' ', '') else: text += unidecode(word).replace(' ', '') if re.match('[A-Za-z]', text[-1]): text += '.' return text
确实是这方面,cleaner
用的是另一种方法,之后我有空加上去吧
japanese_tokenization_cleaners我自己加好了,但UnicodeDecodeError的问题还是存在,请问有没有修改的方法?
现在我的解决方法是强制系统使用utf-8,但是无论哪里都看起来怪怪的……(虽然能跑了)
我在我这边使用该cleaner是正常的,检查一下哪里出错了? 这个与系统编码应该是无关的
只需要在text
文件夹下的cleaners.py
添加该段cleaner代码,然后配置文件的
text_cleaners
项为["japanese_tokenization_cleaners"]
symbols
填上以下即可
[" ", "!", "\"", "&", "*", ",", "-", ".", "?", "A", "B", "C", "I", "N", "U", "[", "]", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "r", "s", "t", "u", "w", "y", "z", "{", "}", "~"]
windows 10的默认编码不是utf-8,而我运行nb是用批处理文件,所以编码和系统保持一致的,所以我就把系统编码改成utf-8了
在utils.py的第64行改为: with open(config_path, "r",encoding='utf-8') as f: 就解决了; 因为有些模型使用的symbols包含gbk非法字符,和cmd使用什么编码无关。 回头一看去年的我真是笨蛋来的 x_x
报错为:UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 1337: illegal multibyte sequence 查了一下应该是默认解码不是utf-8的问题。期望导入的symbols如下: 【 def changeSymbols(self, type): if type == 1: self.symbols = list(' !"&*,-.?ABCINU[]abcdefghijklmnoprstuwyz{}~') self.type = 1 elif type == 2: pad = '' _punctuation = ',.!?-' _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧ↓↑ ' self.symbols = [_pad] + list(_punctuation) + list(_letters) self.type = 2 elif type == 3: pad = '' _punctuation = ',.!?-~…' _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧʦ↓↑ ' self.symbols = [_pad] + list(_punctuation) + list(_letters) self.type = 3 elif type == 4: pad = '' _punctuation = ';:,.!?¡¿—…"«»“” ' _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ" self.symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) self.type = 4 】
除此之外,可以正常转换方式导入的 type==1 转码为: "symbols": [" ","!","\"","&","*",",","-",".","?","A","B","C","I","N","U","[","]","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","r","s","t","u","w","y","z","{","}","~"] 后,也出现了以下问题:
期望添加的改进:对非gbk字符的支持,对japanese_tokenization_cleaners的支持以及对其他非标准模型导入方式的引导。