Open dongrixinyu opened 5 years ago
the same problem, did you solved it
It seems that bilm/model/dump_token_embeddings
is to get word embedding for each word in vocab, but it still relies on characters of the token (see token as char sequence, not independent token) because it used UnicodeCharsVocabulary
.
If you want to dump the dynamic sentence representations for dataset on pure token-level,
bin/train_elmo.py
:
vocab = load_vocab(args.vocab_file, 50)
to
vocab = load_vocab(args.vocab_file, None)
i find the bilm/model/dump_token_embeddings is not useful for the token format imput? cause the char_cnn is not accessible to the token format input while dumping token embeddings
def dump_token_embeddings(vocab_file, options_file, weight_file, outfile): ''' Given an input vocabulary file, dump all the token embeddings to the outfile. The result can be used as the embedding_weight_file when constructing a BidirectionalLanguageModel. ''' with open(options_file, 'r') as fin: options = json.load(fin) max_word_length = options['char_cnn']['max_characters_per_token']