Embedding / Chinese-Word-Vectors

100+ Chinese Word Vectors 上百种预训练中文词向量
Apache License 2.0
11.82k stars 2.32k forks source link

编码方式 #109

Closed Yufei-Z closed 4 years ago

Yufei-Z commented 4 years ago

您好,我下载了金融新闻那个,读取为什么会显示 'utf-8' codec can't decode bytes in position 3561-3562: invalid continuation byte? 附上读取代码

`word_embedding = True

if word_embedding: print('Embedding...') EMBEDDING_FILE = 'D:/sgns.financial.bigram-char' embed_size = 300

def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE, encoding='utf-8'))

word_index = tokenizer.word_index
embedding_matrix = np.zeros((len(vocab) + 1, embed_size))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector`
shenshen-hungry commented 4 years ago

3