Embedding / Chinese-Word-Vectors

100+ Chinese Word Vectors 上百种预训练中文词向量
Apache License 2.0
11.76k stars 2.31k forks source link

词向量文件加载错误 #139

Open lxysl opened 3 years ago

lxysl commented 3 years ago

在使用以下代码加载搜狗新闻Word + Character + Ngram 300d,名为sgns.sogounews.bigram-char的文件时,发生错误:

with open(WORD2VEC_PATH, encoding='utf-8') as f:
    for l in f.readlines():
        values = l.split()
        word = values[0]
        embeddings_index[word] = np.asarray(values[1:], dtype='float32')

错误为:

ValueError: could not convert string to float: '姚'

经过检查,我发现该文件某行的词向量是:

扬 姚 -0.890708 1.429886 ......

所以这个词应该是“扬 姚”吗?还是说“扬”和“姚”对应同一个词向量?


附:我按“扬”和“姚”对应同一个词向量进行解析的代码:

with open(WORD2VEC_PATH, encoding='utf-8') as f:
    for l in f.readlines():
        values = l.split()
        word = values[0]
        try:
            embeddings_index[word] = np.asarray(values[1:], dtype='float32')
        except ValueError:
            word2 = values[1]
            embeddings_index[word] = np.asarray(values[2:], dtype='float32')
            embeddings_index[word2] = np.asarray(values[2:], dtype='float32')
shenshen-hungry commented 3 years ago

“扬 姚”这个地方是个全角空格,用

values = l.split(' ')

进行切分就可以了。

Yufanggg commented 3 years ago

请问你们用的是windows系统还是linux啊? 我用同样的代码加载的时候报错: embeddings_index[word] = np.asarray(values[1:], dtype='float32') TypeError: list indices must be integers or slices, not str