Problem loading keyed vectors

fanavarro commented 3 years ago

Hi, I've been playing a little bit with this amazing library by obtaining the embeddings as described in #2. The standalone application generates a txt and a bin file with the keyed vectors in textual and binary formats, respectively. In particular, I'm calculating the embeddings from the gene ontology, included in the repository.

Nonetheless, I'm experiencing several issues when I try to load the previously generated vectors. On the one hand, I tested the following python instruction:

KeyedVectors.load_word2vec_format(datapath('output.bin'), binary=True)

But I get an exception:

Traceback (most recent call last):
  File "venv/lib/python3.8/site-packages/gensim/models/utils_any2vec.py", line 382, in _load_word2vec_format
    word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
  File "venv/lib/python3.8/site-packages/gensim/utils.py", line 359, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

On the other hand, I also tried to load the txt file with:

KeyedVectors.load_word2vec_format(datapath('output.txt'), binary=False)

Obtaining the following exception:

Traceback (most recent call last):
  File "main.py", line 14, in <module>
    model_b = load_model('output.txt')
  File "main.py", line 7, in load_model
    return KeyedVectors.load_word2vec_format(datapath(model_path), binary=False)
  File "venv/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 1496, in load_word2vec_format
    return _load_word2vec_format(
  File "venv/lib/python3.8/site-packages/gensim/models/utils_any2vec.py", line 394, in _load_word2vec_format
    raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
ValueError: invalid vector on line 2207 (is this really the text format?)

I've been debugging the app, and I found that the incorrect line is the following one:

OBSOLETE. (Was not defined before being made obsolete). -0.35172817 0.9547997 -0.7017195 -0.022278534 -0.21855797 1.2328295 0.026366502 1.0293199 -0.42764938 -0.8031358 -0.7505182 -0.01582495 -1.4183652 0.68057406 0.22078635 0.75405 0.32506666 -1.7469246 0.62090874 0.33088538 0.32958925 -0.21696554 -0.99827904 -1.1616639 -1.3286982 0.89662665 -1.1478066 0.39570102 -0.28800654 0.6889498 1.2787603 1.2980725 -0.19311273 0.61996716 2.1367197 0.5362677 0.38471636 1.7419933 -0.2525881 -1.0632398 -0.23395675 0.9228735 1.0655191 -1.2626935 1.8425548 -0.2289917 -1.3743287 -1.0106764 1.1029646 0.26697654 -0.05864819 -0.5478173 -0.6971337 -1.7715415 0.2442582 -1.2734476 0.25903603 0.6714998 0.0923138 -0.70214653 -0.024936976 -1.3333995 -1.1616304 0.052265227 0.6952294 0.6618334 -0.9966148 1.3055371 2.9172845 1.5078834 2.4491236 -0.41737756 -0.8264428 1.9000809 -0.18261702 0.25123483 0.7783439 0.16481185 0.3635699 -0.29046142 0.54508567 1.2136813 -1.8205711 -1.4147732 0.719116 0.08283793 0.5585965 0.10322688 1.9780725 -1.2655574 0.51070905 -0.9030711 -0.94760007 1.2188694 1.1546952 -0.95993125 1.3770614 0.1960414 -1.4413091 0.20371768

I think that the load function expects a word followed by the vector but, in this case, I have several words. I am using gensim version 3.8.0 when using owl2vec* as well as loading the generated vectors. Doy you have any clue about why this line is included in the embedding files? Should I do some kind of ontology preprocessing, ie removing special characters, in order to avoid this?

Kind regards and thanks for your work.

ernestojimenezruiz commented 3 years ago

Hi Francisco

With addition I added in the other issue, OWL2Vec can generate three files: .txt, .bin and .embeddings. To load the keyed vectors the best option is to use .embeddings.

Ernesto

fanavarro commented 3 years ago

Hi Ernesto, thanks again for your help. I was able to load the keyed vectors with: KeyedVectors.load(datapath('output.embeddings'), mmap='r')

Greetings.

KRR-Oxford / OWL2Vec-Star

Problem loading keyed vectors #4