The evaluation metric seems worse than I anticipated, especially the ppl_token which seems too high. I wonder if that's an averaged or summed measurement. I add a screenshot below and you can see my full results in this google sheet.
This may be related to 1. I noticed when it inialized allenai/OLMo-1B, there's a warning that some many of the weights (if not all) are not initialized correctly. From the log below, it seems to be trying to initialize with the class OlmoForCausalLM.
❓ The question
This is a cross-post from https://github.com/allenai/OLMo-Eval/issues/31 for visibility.
I ran olmo_eval with allenai/OLMo-1B on the paloma dataset and I noticed two issues:
The evaluation metric seems worse than I anticipated, especially the
ppl_token
which seems too high. I wonder if that's an averaged or summed measurement. I add a screenshot below and you can see my full results in this google sheet.This may be related to 1. I noticed when it inialized
allenai/OLMo-1B
, there's a warning that some many of the weights (if not all) are not initialized correctly. From the log below, it seems to be trying to initialize with the classOlmoForCausalLM
.The configuration and environment I use to reproduce this results can be found in the issue https://github.com/allenai/OLMo-Eval/issues/31