allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Will `dump_bilm_embeddings` change orders of inputs and outputs? #194

Closed yslin1995 closed 5 years ago

yslin1995 commented 5 years ago

Hi, when I run the following code, I found that the order of sentences has changed after calling dump_bilm_embeddings().

For example, the lengths of all sentences in 111.txt are [21, 11, 17, 16, 20, 14, 21, 23, 17, 17, 18], but after dumpping sentence embeddings, the shapes of sentence embeddings are [21, 11, 18, 17, 16, 20, 14, 21, 23, 17, 17]. So you can see that the sentence with length 18 has been moved ahead.

However, I didn't find any random or exchange operation codes in dump_bilm_embeddings(), so why is that?

main.py

import h5py
from bilm.model import dump_bilm_embeddings

with open('./111.txt', 'r', encoding='utf-8') as f:
    print([len(line.split()) for line in f.readlines()])
    # [21, 11, 17, 16, 20, 14, 21, 23, 17, 17, 18]

dump_bilm_embeddings(vocab_file='./vocab-2016-09-10.txt',
                     dataset_file='./111.txt',
                     options_file='./elmo_2x4096_512_2048cnn_2xhighway_options.json',
                     weight_file='./elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5',
                     outfile='./ooooo.hdf5'
                     )
with h5py.File('./ooooo.hdf5', 'r') as fin:
    for i in fin:
        print(fin[i][...].shape)  # [21, 11, 18, 17, 16, 20, 14, 21, 23, 17, 17]

111.txt:

A journey to the coldest inhabited place on the planet , Oymyakon , Russia - The Pole of Cold <URL> <URL>
<RT> <@> : MILESTONE : Only 200km left to go !
The Land That Never Melts - A journey through Canada with professional and personal meaning <URL> <URL>
Amateur Dramatics Under The Midnight Sun - A crossing of Northern Scandinavia on foot <URL> <URL>
Looking back to our interview with <@> as he prepared to walk the full length of the Nile … <URL>
Four lifelong friends embark on a three-year journey to cycle the world <URL> <URL>
From an 8 hour board exam to the middle of a near-whiteout in the Pika Glacier within a week <URL> <URL>
<RT> <@> : Had been reading the latest <@> for a week before I realised I was inside the front cover ... <URL>
Sky Walking in the High Sierras - Seeking out remote locations others consider too ambitious <URL> <URL>
To discover , to challenge and to inspire - Why Expeditions Are Important by <@> <URL> <URL>
Tears of the Turtle – pushing further into the deepest known cave in the continental US <URL> <URL>
yslin1995 commented 5 years ago

@matt-peters @PhilipMay

matt-peters commented 5 years ago

The output file is key-value, not an array. The keys are the sentence number cast as str. These lines probably don't iterate in deterministic order, similar to a dict:

with h5py.File('./ooooo.hdf5', 'r') as fin:
    for i in fin:
        ...

Try

with h5py.File('./ooooo.hdf5', 'r') as fin:
    for i in range(num_sentences):
        print(fin[str(i)][...].shape)
yslin1995 commented 5 years ago

The output file is key-value, not an array. The keys are the sentence number cast as str. These lines probably don't iterate in deterministic order, similar to a dict:

with h5py.File('./ooooo.hdf5', 'r') as fin:
    for i in fin:
        ...

Try

with h5py.File('./ooooo.hdf5', 'r') as fin:
    for i in range(num_sentences):
        print(fin[str(i)][...].shape)

Thanks so much, problem solved !