allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Perplexity per sentence implementation? #213

Open BigBorg opened 5 years ago

BigBorg commented 5 years ago

I need to get ppl per sentence for millions of lines. Splitting them into files each containing one sentence would be time consuming. Is it possible to achieve this by modifying dataloader? For example, give the model input as (num_sentences, num_tokens, max_characters_per_token) . The problem is how to pad sentences that doesn't have enough tokens. If this would work, will such padding affect state for next batch? If not, any other suggestions?

demeiyan commented 4 years ago

image

add batch_losses to append losses can get one batch sentences ppl