epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Does nnSent take word n-grams into account? #11

Closed matthias-samwald closed 7 years ago

matthias-samwald commented 7 years ago

I trained the model with wordNgrams set to 2. I tried the same input sentence with permuted word order. I would expect the results to be at least slightly different, since the word n-grams are different, but they are precisely the same. Are the word n-gams not taken into account here?

Query sentence? 
the capital of austria is vienna .
0.721763 964960 the capital of anzoátegui is barcelona .  
0.695541 965465 herisau is the capital of the swiss canton of appenzell ausserrhoden .  
0.682082 964936 it is the capital of the swiss canton of basel - landschaft .  
0.675558 963181 altmann was born in vienna , austria .  
0.624039 965487 its capital is granada .  
0.71819 944282 he is from vienna , austria .  
0.643779 949365 it was the capital of the historic catalan comarque of conflent .  
0.64149 949344 it was the capital of the historic catalan comarque of vallespir .  
0.576923 925080 she was the wife of archduke franz ferdinand of austria .  
0.567474 922188 schell was born on 8 december 1930 in vienna , austria .  
0.562906 898471 it is the capital of the kapilvastu district , in the lumbini zone .  
0.820954 760517 the capital of tyrol is innsbruck .  

Query sentence? 
vienna is the capital of austria .
0.721763 964960 the capital of anzoátegui is barcelona .  
0.695541 965465 herisau is the capital of the swiss canton of appenzell ausserrhoden .  
0.682082 964936 it is the capital of the swiss canton of basel - landschaft .  
0.675558 963181 altmann was born in vienna , austria .  
0.624039 965487 its capital is granada .  
0.71819 944282 he is from vienna , austria .  
0.643779 949365 it was the capital of the historic catalan comarque of conflent .  
0.64149 949344 it was the capital of the historic catalan comarque of vallespir .  
0.576923 925080 she was the wife of archduke franz ferdinand of austria .  
0.567474 922188 schell was born on 8 december 1930 in vienna , austria .  
0.562906 898471 it is the capital of the kapilvastu district , in the lumbini zone .  
0.820954 760517 the capital of tyrol is innsbruck .  

Query sentence? 
capital austria vienna is of the .
0.721763 964960 the capital of anzoátegui is barcelona .  
0.695541 965465 herisau is the capital of the swiss canton of appenzell ausserrhoden .  
0.682082 964936 it is the capital of the swiss canton of basel - landschaft .  
0.675558 963181 altmann was born in vienna , austria .  
0.624039 965487 its capital is granada .  
0.71819 944282 he is from vienna , austria .  
0.643779 949365 it was the capital of the historic catalan comarque of conflent .  
0.64149 949344 it was the capital of the historic catalan comarque of vallespir .  
0.576923 925080 she was the wife of archduke franz ferdinand of austria .  
0.567474 922188 schell was born on 8 december 1930 in vienna , austria .  
0.562906 898471 it is the capital of the kapilvastu district , in the lumbini zone .  
0.820954 760517 the capital of tyrol is innsbruck .  

(I will reply regarding the sorting issue soon)

martinjaggi commented 7 years ago

yes, it does take word bi-grams into account. though the bigrams are more sparse and often get rather small weights. did you observe the same also with one of our pre-trained bi-gram models?

guptaprkhr commented 7 years ago

Hi Matthias, Thanks for pointing out. :) It seems I hadn't added the code for adding n-grams in nnSent and analogiesSent. Can you try it now? I will try to resolve the other issue ASAP. It is most probably due to the commit https://github.com/epfml/sent2vec/pull/8 .

matthias-samwald commented 7 years ago

Your recent commits have solved the issue, different word orders now lead to different results, as expected. Thanks!