Did not wait for the output

jzyxn commented 3 months ago

Quicker_20240512_151305

I used TinyLlama-1.1B-Chat-v1.0 for test(dataset temporal_sequences only have 16 items after deleting), but I didn't see the output after running 30 minutes and the one iteration hadn't completed.

Part of the process is as follows： Quicker_20240512_152130

XinyuanWangCS commented 3 months ago

If I understand correct, this is because the method of PromptAgent uses error examples to do self-reflection. If there is no error, the "while loop" doesn't stop. Later, I will add similar prompt to learn from correct examples. This may solve your issue, but it will be different from the original method in the paper. Also, I'd like to add that PromptAgent or our prompt optimization method is designed for a little bit harder tasks (base model makes errors). If the base model has already been able to solve the task with a simple prompt, then it is not necessary to spend money to optimize prompt for that.

jzyxn commented 3 months ago

Okay, I probably get it.

CSJDeveloper commented 3 months ago

Okay, I probably get it.

I do not think this algorithm can work because the code is totally different from the one reported in the ICLR paper. The main reason is that once there is a problem, the tree structure should be very simple and it should converge even faster.

If the task is simpler, the convergence speed should be faster instead of blocking there (as you reported) without doing anything.

Therefore, I have checked the code, and I personally do not agree with the author.

So, try other things, and we should wait until the authors make more updates to make this project at least run smoothly.

It is a bit sad, but that's the way it is in the research world, nowadays; stories outweigh implementation.

XinyuanWangCS commented 3 months ago

Thanks for pointing out the issue. I agree with you for simpler task the convergence speed is faster, and it actually is, since the paths will be early stopped if the prompt 0 is very good and new prompts are no better, so the searching tree will be shallower. But there are two things to clarify or discuss. First, what this all-correct "blocks" is self-reflection step, which tries to extract information from data points. It is not MCTS algorithm's problem. Second, we have considered the case when this kind of all correct batch happens, and we will resample batches. But I appologize that we didn't consider what if it's all correct in the whole training set. All correct in the training set means a really good prompt (or the task is solved by the model). It will be a question whether we should even do the searching if the state 0 has already reached the maximum reward, like the robot has already been put at the goal point, should it take any actions then. I can certainly break the loop as you said, making the algorithm converge faster, but if so, the prompt willl not be updated (no new prompts). What the user may expect may be seeing some new prompts, so I am considering adding more to the information extraction part. This will be different from the paper, but make the code more robust for users in the application perspective. Sorry for letting you guys meet these issues. But I think our code is following the method in the paper, not "totally different". I have graduated, so the update of the code is a bit slow, but please bring up any issue you meet, which I will fix later. Hope this can make you feel better about research world.

jzyxn commented 3 months ago

I thank you both for your academic rigour and look forward to seeing a more robust code.

CSJDeveloper commented 3 months ago

Thanks for pointing out the issue. I agree with you for simpler task the convergence speed is faster, and it actually is, since the paths will be early stopped if the prompt 0 is very good and new prompts are no better, so the searching tree will be shallower. But there are two things to clarify or discuss. First, what this all-correct "blocks" is self-reflection step, which tries to extract information from data points. It is not MCTS algorithm's problem. Second, we have considered the case when this kind of all correct batch happens, and we will resample batches. But I appologize that we didn't consider what if it's all correct in the whole training set. All correct in the training set means a really good prompt (or the task is solved by the model). It will be a question whether we should even do the searching if the state 0 has already reached the maximum reward, like the robot has already been put at the goal point, should it take any actions then. I can certainly break the loop as you said, making the algorithm converge faster, but if so, the prompt willl not be updated (no new prompts). What the user may expect may be seeing some new prompts, so I am considering adding more to the information extraction part. This will be different from the paper, but make the code more robust for users in the application perspective. Sorry for letting you guys meet these issues. But I think our code is following the method in the paper, not "totally different". I have graduated, so the update of the code is a bit slow, but please bring up any issue you meet, which I will fix later. Hope this can make you feel better about research world.

Thanks a lot for the further reply. More discussion would always be better to gain more insights. Allowing LLMs to self-check and then back to revise the reasoning is a core direction in this domain, at least from my perspective. I track many important work, such as Boosting of Thughts, "Large language models cannot self-correct reasoning yet" and your PromptAgent. For some of them, I am still waiting for the code to be released. Anyway, please let me know when the code is totally ready! Thanks! Because I hope this direction at least makes sense.

XinyuanWangCS commented 3 months ago

I used TinyLlama-1.1B-Chat-v1.0 for test(dataset temporal_sequences only have 16 items after deleting), but I didn't see the output after running 30 minutes and the one iteration hadn't completed.

Part of the process is as follows：

I have updated the code. Right now the program won't be stuck because of all correct issue. You can change the arguments in the example_config.yaml file to use your base model like TinyLlama. I had a simple try of TinyLlama, and I don't think this small model is able to understand the long prompt. Sometimes it generates nothing, sometimes repeated sentences, sometimes it generates an answer but now in a correct format (X), so the answer still can't be extracted from response. I use some general arguments (max_new_tokens, repetition_penalty) for the huggingface text generation model, you may check and modify them or add new arguments in the hf_textgeneration_model.py file in the language_model folder. I would recommand you using instruction-tuned llama7b or mistral7b as base model, as we said expert-level prompts are for advanced models. Also, the optim model should be a strong model like GPT-3.5.

jzyxn commented 3 months ago

I used TinyLlama-1.1B-Chat-v1.0 for test(dataset temporal_sequences only have 16 items after deleting), but I didn't see the output after running 30 minutes and the one iteration hadn't completed. Part of the process is as follows：

I have updated the code. Right now the program won't be stuck because of all correct issue. You can change the arguments in the example_config.yaml file to use your base model like TinyLlama. I had a simple try of TinyLlama, and I don't think this small model is able to understand the long prompt. Sometimes it generates nothing, sometimes repeated sentences, sometimes it generates an answer but now in a correct format (X), so the answer still can't be extracted from response. I use some general arguments (max_new_tokens, repetition_penalty) for the huggingface text generation model, you may check and modify them or add new arguments in the hf_textgeneration_model.py file in the language_model folder. I would recommand you using instruction-tuned llama7b or mistral7b as base model, as we said expert-level prompts are for advanced models. Also, the optim model should be a strong model like GPT-3.5.

Thanks again for the code changes and your discussion with another friend.

CSJDeveloper commented 3 months ago

I used TinyLlama-1.1B-Chat-v1.0 for test(dataset temporal_sequences only have 16 items after deleting), but I didn't see the output after running 30 minutes and the one iteration hadn't completed. Part of the process is as follows：

I have updated the code. Right now the program won't be stuck because of all correct issue. You can change the arguments in the example_config.yaml file to use your base model like TinyLlama. I had a simple try of TinyLlama, and I don't think this small model is able to understand the long prompt. Sometimes it generates nothing, sometimes repeated sentences, sometimes it generates an answer but now in a correct format (X), so the answer still can't be extracted from response. I use some general arguments (max_new_tokens, repetition_penalty) for the huggingface text generation model, you may check and modify them or add new arguments in the hf_textgeneration_model.py file in the language_model folder. I would recommand you using instruction-tuned llama7b or mistral7b as base model, as we said expert-level prompts are for advanced models. Also, the optim model should be a strong model like GPT-3.5.

Thanks! Days later, I will try to reproduce Table 5 and Figure 5 of the paper. From my perspective, the way to self-improve with LLMs is insane, as LLMs frequently output hallucinations (especially small-scale ones, llama7b), and there should be complex mechanisms to avoid this before making it work. This may be a good try. Discuss with you then.

Thanks again.

XinyuanWangCS / PromptAgent

Did not wait for the output #7