dvlab-research / Step-DPO

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
279 stars 8 forks source link

deepseek-math-7b-rl-stepdpo推理后的结果问题 #13

Open wjn1996 opened 3 months ago

wjn1996 commented 3 months ago

你好,使用released的deepseek-math-7b-rl-stepdpo模型,在dpo10k的数据上进行推理,效果却很差,是什么原因?

按照提供的推理参数和deepseek-math对应的template temperature=0.9,topp=0.95

prompt输入:

{"role": "user", "content": "Jenny has 8 stamp books, each containing 42 pages with 6 stamps on each page. She decides to reorganize her books so that each page now contains 10 stamps. She fills up 4 complete books, still with 42 pages per book. In her fifth book, she has 33 pages filled with 10 stamps per page and 1 last page with the remaining stamps. How many stamps are on the last page of her fifth book?\nPlease reason step by step, and put your final answer within \\boxed{{}}"}

prompt输出:

{
"role": "assistant", 
"content": " Calculate total number of stamps in all four completed books.\\njenny starts with eight stamp books, where seven of them (the first four) each contain 42 pages with 6 stamps on each page. So, each book has 42 * 6 = 252 stamps.\\ntherefore, for these four books, there are 4 * 252 = 1008 stamps in total.\\ndefinition\\nsolving problems involving fractions can be challenging but it doesn’t have to be!\\nyou just need a few simple steps:\\nnfirst identify what fraction you want to solve or simplify,\\nin this case we start off knowing $\\frac{a}{b}$. Then use cross-multiplication if needed which means multiplying both top numbers together ($ac$)and bottom numbers ($\\cdot bd$), giving us $ad=bc$.
\\nsometimes people find working out common denominators easier than usingcross multiplication directly - especially when dealingwith more complex equations like$\\tfrac{\\sqrt x + y}{\\ sqrt z+w}.\\nit really depends how comfortableyou feel solvingproblems!either way will give sameanswer unless specified otherwiseby teacher/textbook author etcetera..so keep calm & carryon no matterwhich method seemsmore straightforwardatmoment :)step one : multiply numeratorexpression numeratorpart×denominator part ->$ axbxcdyczd=(axc)( bxd)=(ayzcbwd)$\\nweshould endup having equivalentfractions regardlessofmethodused!(just making sureeveryone understandsthisbitok?).now let‘slighterinto secondstageoftask.!.[underline]\"second stage\"[/\"][bold](simplifying)[/-]\
\ngiven two(or moreshighlycomplexmathsythingsstillworksameway)\\nequationsthatlooklike:$e=\\fracefgfhij$.wewanttosimplifytheseuntiltheyareinlowesttermspossiblewithoutlosinganyinformationfromoriginalequation.(i\"llusethisexamplecauseitsnottoooo complicatedbut basically anythinggoes here)!to dothisis pretty easy actually;allwer e gotta doistakethe num&dem individuallyandrearrangemthem sothateachoneonlyhasprimefactorsleftafterdivisionortakingoutcommon factorsifnecessary-[highlight]'likwiseordraw alinesacrossnumeratorisndivideeachnumberbysmallest possibleintegeruntilit cant bedownfurther'[unhighlight]-thenrepeatprocessforthedenominaterightnexttothenumeratorline-(trytoputspacesbetweenlinesforclarity).when finishedbothsidesshouldhavelikestandardformexceptmaybeahugebiggerthan usualcoefficientontop--(dona/forgetaboutitthoughbecause weregoingtocancelith!)--finallymultiplyingthroughboththesidesgets ridoffactorsandsleavesustwithexactsamethingasyour original equation only muchsimplerlooking![italic].\"as long as everyone stays focus[/italics]\",
no mistakeswillhappenhere:[reset].letstartagain:.givenequatio nwas$[tex]\\\\E\\\\\\\\ =\\ \\\\\\\\fractextrmegft ext rmhgjfifigjj$:splitnum eratopart into primes ,get $[itext]=\\ fractr\\\\\\ p^ \\\\\\ q\\/tr\\$ [strike]{\\ color {red}p}^ {\\ colo r {#ba na }}\\ / tr\\$.denomi natorestilt he sam [#vanish]{color{#ffda}\\ smalldarkgreen }{rs}.cancelling identicalprimesona bot hs sid esgive$$k=[col o rprouncesomewhathardtomakesuretypoicallyemptybox]*l $$foreachpairsofidenticalfactorsexistsbetweeenthet op&\\bottomthatcanbecancelled .repeatedfo rationgiva essameresult astherstattemptabovewhichtellsu stheyared ideni cal !thatswhy mathematiciansalwayskeepcheckingbackovertheir work:(correctmyselfifa missthingshapp en).\\ finallymultiply through t obot hl sides getridofo actors leavesuse wit hexactlysmalle rieq uivalentthingyourorigina l equati \\$onge was ;$$\\ E\\[smallblacktriangle]= \\[fontfamily=\"Times New Roman\"]\\{ }[\\ fra cex]/_{mathrmbfq }\\^{boldsymbol}/_{\\ mathsfv}/\\dfracfxgxhxixjykxlxmxnxoypx\\_/{}_{mat hbb}]{rrrr}[\\mathb ff}]F\\[\\ smallbl ack triangle\\]G\\(\\normalsize V\\./\\)/PXQYRZSUTWVXYZWVXYZWZYXVUVTSSRQONMLK JIHGFEDCBA\\`_ ]\\)dividingeverything across line makes thingssimplersomuch!! .\\ okayletsgo backtonumbers.$$ lookclosely...everythin gcheckoutsright ?yesindeeditisfinehoweverthereseemsto betwoextra\\,…wellactually wedid makeanerrorbefore cancel ing ...forgottogiveattention tot hatlittle \\, right?\\somessomehow gotstuckinfin dlegolasprogramtryingtodoitse"}

就是生成的文本都很怪,这是什么原因呢?

x6p2n9q8a4 commented 2 months ago

我用DPO也有这个问题;就是DPO之后的模型 输出的东西有幻觉 都不合法