allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

bilm/model/dump_token_embeddings is not useful for the token format imput? #167

Open dongrixinyu opened 5 years ago

dongrixinyu commented 5 years ago

i find the bilm/model/dump_token_embeddings is not useful for the token format imput? cause the char_cnn is not accessible to the token format input while dumping token embeddings

def dump_token_embeddings(vocab_file, options_file, weight_file, outfile): ''' Given an input vocabulary file, dump all the token embeddings to the outfile. The result can be used as the embedding_weight_file when constructing a BidirectionalLanguageModel. ''' with open(options_file, 'r') as fin: options = json.load(fin) max_word_length = options['char_cnn']['max_characters_per_token']

vocab = UnicodeCharsVocabulary(vocab_file, max_word_length)
batcher = Batcher(vocab_file, max_word_length)
gailysun commented 5 years ago

the same problem, did you solved it

nefujiangping commented 4 years ago

It seems that bilm/model/dump_token_embeddings is to get word embedding for each word in vocab, but it still relies on characters of the token (see token as char sequence, not independent token) because it used UnicodeCharsVocabulary.

If you want to dump the dynamic sentence representations for dataset on pure token-level,