adding Archangel models (dpo, kto, sft+dpo, sft+kto)

The Archangel suite of models contain DPO, SFT+DPO, KTO, SFT+KTO models which can also be used as reward models: https://huggingface.co/collections/ContextualAI/archangel-65bd45029fa020161b052430

For each method, there are seven models available: pythia-{1.4, 2.8, 6.9, 12.0}B and llama-{7, 13, 30}B, all of which have been aligned under nearly identical settings on {Anthropic HH, Open Assistant, SHP 1.0} data.

The implied reward for both DPO- and KTO-aligned models is $\beta \log \frac{\pi\theta(y|x)}{\pi\text{ref}(y|x)}$, where $\pi_\text{ref}$ is the reference model

The reference model for each set of models in Archangel is as follows:

for the SFT+DPO model ContextualAI/archangel_sft-dpo_{model}, the reference is ContextualAI/archangel_sft_{model}
for the SFT+KTO model ContextualAI/archangel_sft-kto_{model}, the reference is ContextualAI/archangel_sft_{model}
for the DPO model w/o SFT ContextualAI/archangel_dpo_llama7b, the reference is huggyllama/llama-7b, which can be found in the _name_or_path field in config.json
for the KTO model w/o SFT ContextualAI/archangel_kto_llama7b, the reference is huggyllama/llama-7b, which can be found in the _name_or_path field in config.json

allenai / reward-bench

adding Archangel models (dpo, kto, sft+dpo, sft+kto) #84