Evaluating a self-trained model from scratch using the LM Evaluation Harness

HazyResearch / based

Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"

Apache License 2.0

206 stars 13 forks source link

Evaluating a self-trained model from scratch using the LM Evaluation Harness #16

Open lisc110h opened 3 months ago

lisc110h commented 3 months ago

I hope this message finds you well.

I have recently started researching DNN architecture and am currently studying using the BASED codebase. BASED is an excellent piece of work and has been extremely helpful.

I would like to train BASED from scratch and evaluate it following the procedures outlined in the LM Evaluation Harness. However, the evaluation method described in the Evaluate section is intended for HuggingFace models, and I am having trouble evaluating the model I have trained myself. If possible, could you provide an explanation of how to perform this evaluation, or if that is not feasible, some guidance on how to proceed?

simran-arora commented 3 months ago

Hi thank you for your question! I just added files here in this commit that you could adapt: https://github.com/HazyResearch/based-evaluation-harness/commit/c6a3427b3ac6f039922e60004a4dfe60d370f9e0

Essentially you can take the "Run path" model tag from the "Overview" tab of WandB for your training run and plug it into the .sh file I provided. Uncomment local_lm's import in the lm_eval/models/init.py file and you may need to adjust the import paths for your setup, but otherwise this should just pull the model relatively easily. Let us know if you have questions!

lisc110h commented 2 days ago

After making various adjustments and experiments based on the corrected source code you provided, I was finally able to evaluate the accuracy of the model trained locally. Thank you very much for your help (I also managed to resolve several bugs along the way).

Now I am attempting to reproduce the results from Table 1 (or B.5) of your latest paper, "Just Read Twice: Closing the Recall Gap for Recurrent Language Models,". I have been running the lm_eval without any special options, specifying only the task and model options as I did for the Reasoning evaluation. May I assume that this would replicate the evaluation conditions outlined in the paper?

However, I am having difficulty achieving the same accuracy results for BASED as reported in the paper. If there are specific settings required for each dataset, I would appreciate it if you could kindly let me know so I can review the source code accordingly.

simran-arora commented 2 hours ago

Hi -- the JRT approach has it's own specialized lm_eval_harness in that repo; are you using it?

In JRT, I needed to make sure we're handling limited context length appropriately and also do left-hand-side padding so it has some custom aspects. https://github.com/HazyResearch/prefix-linear-attention/tree/main/lm-eval-harness

^ here is the code and it should repro everything in the JRT paper