JuliaText / Embeddings.jl

Functions and data dependencies for loading various word embeddings (Word2Vec, FastText, GLoVE)
MIT License
81 stars 19 forks source link

Sentence and paragraph (etc) distances? #36

Open robertfeldt opened 3 years ago

robertfeldt commented 3 years ago

Thanks for this package; very useful.

Would it make sense to include simple multi-word distance metrics like MOWE (mean/median of word embeddings) etc in this package or is that already available in other packages of JuliaText? I didn't find it but seems a quite common use case for people that download Embeddings.jl. An alternative might be to make these part instead of StringDistances.jl.

oxinabox commented 3 years ago

I agree that is a common use. I mean my PhD thesis was on the fact that such simple linear combinations of word embeddings often out peform more sophisticated methods.

But I am not sure it is worth including in the package. The package is intentionally the bare minimum just handling data loading. It doesn't even handle looking up index for words. The user is left to do that by writing somehting like

const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))
get_embedding(word) = embtable.embeddings[:, get_word_index[word]]

Which allows them to do something fancier if they have for example loaded there words into a PooledArray etc.

Similarly, thingsl like sums of embeddings are also 1 liners.

sowe(words) = sum(get_embedding, words)
mowe(words) = mean(get_embedding, words)

and if they want to do something fancier to handle out of vocabulary etc then they are free to do so

robertfeldt commented 3 years ago

Yes, I saw your thesis (but haven't read it all).

Sure, it's simple enough to keep it out. I figured not everyone who needs sentence/paragraph distances would know about sowe/mowe so having it in a package might make it easier but maybe many do. Anyway, no problem.

BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically?

oxinabox commented 3 years ago

BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically?

Straight mowe/sowe is so simple to implement it should be the first thing you try (possibly after plain BoW). I am not sure that any kind of processing sentence wise would give much gain it might. But it might not. It seems like it would be annoying since you need to deal with difference sentneces in different order. and different numbers of sentences. Maybe though at that point oen can just go straight up to a more fancy model.