Great observations and work on disentangling the format following from reasoning! Could we share details on evaluation dataset we used and how we can reproduce the result in the paper? I have fine tuned llama3 on the dataset and achieved worse performance in 30 questions curated from HotpotQA dataset. If you could share some light on this it would be super appreciated! Thanks,
Jason
Hey,
Great observations and work on disentangling the format following from reasoning! Could we share details on evaluation dataset we used and how we can reproduce the result in the paper? I have fine tuned llama3 on the dataset and achieved worse performance in 30 questions curated from HotpotQA dataset. If you could share some light on this it would be super appreciated! Thanks, Jason