commonsense / conceptnet-numberbatch

Other
1.29k stars 143 forks source link

Lemmatization for SNLI #48

Closed chledowski closed 6 years ago

chledowski commented 6 years ago

Hi,

I would like to use your embedding on SNLI dataset. However, due to lemmatization, almost half of the words have no embeddings. Therefore I'd like to lemmatize the SNLI dataset.

I am wondering, which lemmatization algorithm would be best to get a dataset similar to Conceptnet Numberbatch

rspeer commented 6 years ago

Thanks for your interest! I'm definitely interested in seeing the data applied to more tasks.

There hasn't been lemmatization in Numberbatch in over a year. Are you following a very old link?

chledowski commented 6 years ago

Hi!

I have downloaded the embeddings from http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5 - so its the newest version of embeddings in .h5

I thought that there was lemmatization, because when I compare conceptnet embeddings to Glove common crawl (840B) on the dictionary from SNLI dataset, I get following results: CC840: Found 38103 words in the dictionary. Missing 4289 words. CNet: Found 23970 words in the dictionary. Missing 18422 words.

Do you know, what could cause this problem? Apart from a bug in my code :)

chledowski commented 6 years ago

I have read, that mini.h5 is a smaller version, so I thought that might be the problem. Therefore, I've downloaded the embeddings from wget http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/16.09/numberbatch.h5.

Unfortunately, I have identical results.

Therefore, I am wondering - do all .h5 files contain smaller amount of words?

If so, then do you have a .h5 version of conceptnet that contains all words? :)

chledowski commented 6 years ago

I figured out, that I have left some words with upper case. now only 10000 out of 40000 words are not found. These are words with misspelling (eg. "womann" instead of "woman").

These can probably be dealt with using the Levenshtein distance.

Closing, thanks