Batch eval and lm-eval update

Add OpenAI batch API feature for cheaper LM judge implementation, along with a few tool scripts to manage batch jobs.
Upgrade lm-eval compatibility. Recent models, like Llama 3.2, are supported. --apply_chat_template option is available with generic lm-eval support.
Push template for Llama and Qwen model families. Note that --chat_template should be general if --apply_chat_template is enabled to prevent double templates.

allenai / SciRIFF