Hi, great work! I am having some issues with training Llama-2-13b-chat on Anthropic HH dataset.
I followed to train SFT on HH and then DPO according to README
The only things I change is policy_dtype: bfloat16 to use Flash attention V2 and change the tokenization so that it match llama-2 instruction-following format. Here are examples of the tokens
Hi, great work! I am having some issues with training Llama-2-13b-chat on Anthropic HH dataset.
I followed to train SFT on HH and then DPO according to README
The only things I change is
policy_dtype: bfloat16
to use Flash attention V2 and change the tokenization so that it match llama-2 instruction-following format. Here are examples of the tokensHowever, I found the reward accuracies are not better than 50%, see below, and comparison performances are worse than before (evaluated by GPT-4)
As our system cannot access public Wandb, so I don't have wandb link or better metric indications to diagnose.