dvlab-research / Step-DPO

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
285 stars 9 forks source link

Does step-dpo work? #19

Open hxdtest opened 1 month ago

hxdtest commented 1 month ago

There is a problem in the paper to demonstrate step-dpo effectiveness: The square root of t is greater than 2 and less than 3.5. How many integer values of t satisfy this condition? However I change the prompt as "The square root of t is greater than 2.3 and less than 3.5. How many integer values of t satisfy this condition?" The answer is like

image

image

image