HIT-SCIR / ELMoForManyLangs

Pre-trained ELMo Representations for Many Languages
MIT License
1.46k stars 244 forks source link

Different output vectors for same sentences #30

Open ntanhhus opened 5 years ago

ntanhhus commented 5 years ago

Hi, I am using ELMo for Japanese. Here is my code:

from elmoformanylangs import Embedder
e = Embedder('/Users/tanh/Desktop/alt/JapaneseElmo')

if __name__ == '__main__':
    sents = [
        ['今'],
        ['今'],
        ['潮水', '退']
    ]
    print(e.sents2elmo(sents))
    print(e.sents2elmo(sents))

And here is the console output: `2018-11-14 10:33:26,441 INFO: 1 batches, avg len: 3.3 [array([[-0.23187001, -0.09699917, 0.46900252, ..., -0.33114347, 0.18502058, -0.27423012]], dtype=float32), array([[-0.23187001, -0.09699917, 0.46900252, ..., -0.33114347, 0.18502058, -0.27423012]], dtype=float32), array([[-0.11759937, -0.04552874, 0.22546595, ..., 0.21812831, -0.33964303, -0.33022305], [-0.26380852, -0.27671477, -0.33576807, ..., 0.15142155, -0.04612424, -0.74970037]], dtype=float32)]

2018-11-14 10:33:26,734 INFO: 1 batches, avg len: 3.3 [array([[-0.25601366, -0.10413959, 0.45184097, ..., -0.34171066, 0.18976462, -0.2817447 ]], dtype=float32), array([[-0.25601366, -0.10413959, 0.45184097, ..., -0.34171066, 0.18976462, -0.2817447 ]], dtype=float32), array([[-0.12085894, -0.05347676, 0.18303208, ..., 0.22256255, -0.37257898, -0.39672664], [-0.21205096, -0.31738985, -0.34304047, ..., 0.24654591, -0.07900852, -0.710617 ]], dtype=float32)] ` So as you can see, the output is different when I run sents2elmo twice, is this normal or a bug? If it's normal so how can I prevent it from happening again?

blouargant commented 5 years ago

Hello, I have exactly the same behavior with the French model and I was going to open an issue with the very same questions :)

After something like 10 loops over the same sentence, word vectors start to stabilize. It looks like the model continue to train even after calling .eval() function.

Note that the output of the word encoder ( output_layer=0) always give the same results. Only the outputs of the LSTMs are changing.

tkon3 commented 5 years ago

Hello, I have this behavior aswell. I guess its something related to LSTM internal states as stated in AllenNLP note (Notes on statefulness and non-determinism) : https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

Do we need to specify special tokens at the begining/end of each sentence ?

jianminli55 commented 5 years ago

I met the same problem,Do you solver the problem?

ghpu commented 5 years ago

I got stable embeddings if I add <bos> and <eos> tokens around each sentence.

PawelFaron commented 5 years ago

The same issue here. No solution I guess?

PawelFaron commented 5 years ago

I got stable embeddings if I add <bos> and <eos> tokens around each sentence.

They are added either way in read_list function and it doesn't help. If it helps in your case could you please share the code?

ghpu commented 5 years ago

@PawelFaron My mistake, I got stable embeddings if I recreate an Embedder each time :-( `test={ "fr":[["Les","chaussettes","de","l'","archiduchesse","sont","elles","sèches","?"],["Test","de","phrase","en","batch"],["La","stabilité","n'","est","pas","toujours","présente"]] } """ ,"en":[["This","is","outrageous","!"],["Is","n't","it","lovely","?"]] ,"es":[["¿","Qué","hora","es","?"]] ,"ar":[["ﺎَﻠﺴَّﻟَﺎﻣُ","ﻊَﻠَﻴْﻜُﻣْ"]] ,"pl":[["na","zdrowie"]] } """ def add_bos(sentence_list): # in_place modification for s in sentence_list: if s[0]!="": s.insert(0,"") if s[-1]!="": s.append("")

for k,v in test.items(): add_bos(v)

for iterations in range(3): for k,v in test.items(): e = Embedder("data/ELMo/"+k+"/",batchsize=(4)) result=e.sents2elmo(v) for sid,sentence in enumerate(result): numpy.savetxt("data/"+str(k)+""+str(iterations)+"_"+str(sid)+"_0.txt",sentence,fmt="%.4f") `

Without explicitly adding <bos> and <eos> , I didn't even have stability between each iteration :-(

petious commented 3 years ago

I have come across the same problem on my end:

>> test1 = e.sents2elmo([['voiture', 'deux', 'places']])
array([[ 0.16128092, -0.14193378,  0.0679277 , ...,  0.0372387 ,
        -0.01055785, -0.1358149 ],
       [ 0.11865471,  0.03754618,  0.14029296, ...,  0.05880207,
        -0.30177963,  0.04773434],
       [-0.0712164 , -0.19249576, -0.1217883 , ...,  0.23560412,
        -0.17588478,  0.64805335]], dtype=float32)
>> test2 = e.sents2elmo([['voiture', 'deux', 'places']])
array([[ 0.10224233, -0.09587628,  0.11073047, ..., -0.02238917,
        -0.07524507, -0.28675506],
       [ 0.06666292,  0.03427267,  0.17593448, ..., -0.01624204,
        -0.33423385, -0.00192846],
       [-0.08335221, -0.20575272, -0.12499571, ...,  0.10726541,
        -0.3955935 ,  0.584318  ]], dtype=float32)

Any ideas on what's causing it/how to fix it?

jianminli55 commented 3 years ago

no,I use bert instead