HIT-SCIR / ELMoForManyLangs

Pre-trained ELMo Representations for Many Languages
MIT License
1.46k stars 243 forks source link

How to get the embedding for each word in the sentance? #40

Open ghost opened 5 years ago

ghost commented 5 years ago

Hi,

I am struggling to get the embedding for individual words. I used this command:

python -m elmoformanylangs test --input_format conll --input input.conllu --model ar.model --output_prefix ./output/ --output_format hdf5 --output_layer -1

And it dumbs hdf5 encoded onto the disk, as said. However, as far as I understand, the file encoded a dict where the key is tab speerated sentence, and the value is its representation.

But when I print the key:


f = h5py.File(filename, 'r')

for key in list(f.keys()):
    print(key)

I can see that f.keys() contain only a one string key of all sentences in the input file. 1) Why? And how to get individual sentence representation? 2) How to get individual word representation?

This is example of my input with 2 sentences :

1   ik  ik  PRON    VNW|pers|pron|nomin|vol|1|ev    Case=Nom|Person=1|PronType=Prs  2   nsubj   2:nsubj _
2   zie zien    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin 0   root    0:root  _
3   hem hem PRON    VNW|pers|pron|obl|vol|3|ev|masc Case=Acc|Person=3|PronType=Prs  2   obj 2:obj|4:nsubj:xsubj _
4   fietsen fietsen VERB    WW|inf|vrij|zonder  VerbForm=Inf    2   xcomp   2:xcomp _
1   Jan Jan PROPN   N|eigen|ev|basis|zijd|stan  Gender=Com|Number=Sing  2   nsubj   2:nsubj _
2   komt    komen   VERB    WW|pv|tgw|met-t Number=Sing|Tense=Pres|VerbForm=Fin 0   root    0:root  _
3   vandaag vandaag ADV BW  _   2   advmod  2:advmod    _
4   en  en  CCONJ   VG|neven    _   5   cc  5.1:cc  _
5   Piet    Piet    PROPN   N|eigen|ev|basis|zijd|stan  Gender=Com|Number=Sing  2   conj    5.1:nsubj   _ 
Oneplus commented 5 years ago

Hi, the value of f[key] is a numpy array of (seq_len, dim) (If you use the recently patch and output all the layers, it will be (n_layer, seq_len, dim)). You can get embeddings for each word by numpy.split along the seq_len dimension.

ghost commented 5 years ago

So the sentence embedding is not averaged, I understand now. However, in f[Key] , the key should be the sentence itself, right?

Another problem that I mentioned in my issue is regarding the input format, I suspect that I am doing something wrong because when I print the length of f.keys(), it returns 1 even that my input contains more than one sentence. So this loop is executed only once and treat all my sentences as a single one.

for key in list(f.keys()):
    print(key)

Am I doing something wrong?

The

Oneplus commented 5 years ago

So the sentence embedding is not averaged, I understand now. However, in f[Key] , the key should be the sentence itself, right?

Yes, the key should be the sentence itself.

Am I doing something wrong?

Please check if your input file follows the conll format (https://github.com/HIT-SCIR/ELMoForManyLangs#use-elmoformanylangs-in-command-line) and specify the input format as conll