Thank you very much for your sharing. It has helped me in my current study and work to a great extent. I observed that in your latest version of PPO training, you will randomly add [bad] or [good] before the problem embedding, and the reverse number will be taken if there is [bad] mark in the calculation of the reward. Could you tell me the reason for this? Why not take the probability of LABEL_1 as a reward? I don't know what the effect of randomly selecting a part of your code to set it as the opposite number will be here.
I am looking forward to your answer. Thanks a lot.
Thank you very much for your sharing. It has helped me in my current study and work to a great extent. I observed that in your latest version of PPO training, you will randomly add [bad] or [good] before the problem embedding, and the reverse number will be taken if there is [bad] mark in the calculation of the reward. Could you tell me the reason for this? Why not take the probability of LABEL_1 as a reward? I don't know what the effect of randomly selecting a part of your code to set it as the opposite number will be here. I am looking forward to your answer. Thanks a lot.