Quick questions. Thanks.

hohoCode commented 8 years ago

Thanks for the great opensource and embeddings!

I just have a few quick questions:

1) Glove embeddings (that 840B version)'s vocabulary size is: 2196017; while your embeddings have 1453347 words only. Since I am under the impression that your approach combines lots of resources: word2vec/Glove/PPDB/ConceptNet, could you please clarify why yours has a much smaller vocabulary size (~66%) than Glove?

2) Is this because you combine words together into phrases? I found there are lots of phrase words in your vocabulary, like "supreme_court", "washington_dc", "san_francisco" or "natural_gas" etc. And Glove does not have these phrase words.

3) BTW, any possibilities to release your emb in plain text files (zipped, just like Golve's format)? instead of numpy matrices?

Thanks again!

rspeer commented 8 years ago

There are various differences in the vocabularies, but one thing that's going on here is that Conceptnet Numberbatch uses lemmatization -- it merges different forms of a word into their root form. GloVe would train separate vectors for "decide", "decides", and "decided", for example, as well as various capitalizations of these words. These forms have no inherent connection to each other in GloVe, so it relies on having lots of data to make them come out similar. This works in many cases but leads to bad vectors for rarer word forms.

Numberbatch only has the one vector for "decide", and wants you to normalize your text to match before looking it up.

On top of that, vocabulary items that only appear in one resource way down the list are excluded. (Look at the end of GloVe's vocabulary and you'll see terms such as "working.So".)

That all reduces the vocabulary size, but then the vocabulary is also expanded by including phrases, which come from ConceptNet and from word2vec Google News.

I find GloVe's text format tremendously inefficient. The first thing you have to do with it is convert a large number of decimal strings to floating-point, which I make sure to save the result of so I never have to do it again.

word2vec Google News comes in an idiosyncratic floating-point format only. Using NumPy format seems like an improvement over that. I also plan to offer HDF5 format as well, which is even more standard.

If you need to read the NumPy format without NumPy, here's what it is:

10 magic bytes to identify the filetype
A human-readable header that ends with a linebreak (0x0a)
All the MxN entries in the matrix, in row-major order, as 8-byte values that represent little-endian double-precision floats

For reference, the header for the 600d matrix says: {'descr': '<f8', 'fortran_order': False, 'shape': (1453348, 600), }. 'descr': '<f8' means little-endian double-precision, and 'fortran_order': False means it's row-major.

Hope this helps.

hohoCode commented 8 years ago

Thanks a lot for answering all my questions!!

Great opensource work~~~! :+1:

commonsense / conceptnet-numberbatch

Quick questions. Thanks. #34