I need to get ppl per sentence for millions of lines. Splitting them into files each containing one sentence would be time consuming. Is it possible to achieve this by modifying dataloader? For example, give the model input as (num_sentences, num_tokens, max_characters_per_token) . The problem is how to pad sentences that doesn't have enough tokens. If this would work, will such padding affect state for next batch? If not, any other suggestions?
I need to get ppl per sentence for millions of lines. Splitting them into files each containing one sentence would be time consuming. Is it possible to achieve this by modifying dataloader? For example, give the model input as (num_sentences, num_tokens, max_characters_per_token) . The problem is how to pad sentences that doesn't have enough tokens. If this would work, will such padding affect state for next batch? If not, any other suggestions?