beamandrew / medical-data

5.83k stars 1.13k forks source link

data type of embedding file for Clinical Concept Embeddings Learned from Massive Sources of Medical Data #26

Closed grv1207 closed 6 years ago

grv1207 commented 6 years ago

Hi, I downloaded the pre-trained embedding file. The file type says its a csv but actually its a binary, I used python dictionary to open it but I get an error. I have also used gensim, KeyedVectors to load embedding but I get error word_vectors = KeyedVectors.load_word2vec_format('__MACOSX/emb.csv', binary=True)

changed name of the file to emb.csv

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 37: invalid start byte

So could tell me as to what tool is needed to open this file..?

beamandrew commented 6 years ago

The file is a CSV but it is compressed as a .zip to save space. You will need to unzip it before you can load it.