词向量文件加载错误

lxysl commented 3 years ago

在使用以下代码加载搜狗新闻Word + Character + Ngram 300d，名为sgns.sogounews.bigram-char的文件时，发生错误：

with open(WORD2VEC_PATH, encoding='utf-8') as f:
    for l in f.readlines():
        values = l.split()
        word = values[0]
        embeddings_index[word] = np.asarray(values[1:], dtype='float32')

错误为：

ValueError: could not convert string to float: '姚'

经过检查，我发现该文件某行的词向量是：

扬　姚 -0.890708 1.429886 ......

所以这个词应该是“扬　姚”吗？还是说“扬”和“姚”对应同一个词向量？

附：我按“扬”和“姚”对应同一个词向量进行解析的代码：

with open(WORD2VEC_PATH, encoding='utf-8') as f:
    for l in f.readlines():
        values = l.split()
        word = values[0]
        try:
            embeddings_index[word] = np.asarray(values[1:], dtype='float32')
        except ValueError:
            word2 = values[1]
            embeddings_index[word] = np.asarray(values[2:], dtype='float32')
            embeddings_index[word2] = np.asarray(values[2:], dtype='float32')

shenshen-hungry commented 3 years ago

“扬　姚”这个地方是个全角空格，用

values = l.split(' ')

进行切分就可以了。

Yufanggg commented 3 years ago

请问你们用的是windows系统还是linux啊？我用同样的代码加载的时候报错： embeddings_index[word] = np.asarray(values[1:], dtype='float32') TypeError: list indices must be integers or slices, not str

Embedding / Chinese-Word-Vectors

词向量文件加载错误 #139