Timothy023 / RLMEC

The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"
8 stars 4 forks source link

Reward value error while generating training data for rlmec #2

Closed liminghao0914 closed 10 months ago

liminghao0914 commented 10 months ago

Hi Timothy,

Thanks for sharing your impressive work. Besides some minor bugs in your code, I'm facing a major obstacle while generating training data for rlmec after rewriting the generated samples by a well-trained grm.

I followed the steps in your readme and eventually found that the rewards in generated training data are all Nan. One example item of the rlmec_qa.jsonl would be like

{
    "input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nthe ability to store what can be used for keeping warm\nOptions:\nlipids\nfats\nknowledge\nwater\nenergy\nfur\ndata\nheat\n\n### Response: Let's think step by step.\n",
    "output": "\n1. **Heat**: This is the most important factor in keeping warm. It's the energy required to warm up an object or person.\n2. **Fur**: It's an insulating layer that can keep your body warm. Many animals, like bears, have fur for this purpose.\n3. **Knowledge**: While it's not directly related to keeping warm, having knowledge about survival skills could be useful in emergency situations where you need to find ways to stay warm.\n4. **Data**: Data is not directly related to keeping warm.\n5. **Energy**: Energy is required for various activities, including keeping warm. Without energy, you cannot generate heat.\n6. **Water**: Water is essential for survival, but it's not directly related to keeping warm.\n7. **Lipids (Fats)**: Fats are a type of energy storage in the body. They can be converted into energy to generate heat when needed.\n8. **Fur**: As mentioned earlier, fur is an insulating layer that helps keep your body warm.\n\nSo, the best options for keeping warm are heat, fur, and lipids (fats).",
    "regular": "1. **Heat**: This is the most important factor in keeping warm. It's the energy required to warm up an object or person.\n2. **Fur**: It's an insulating layer that can keep your body warm. Many animals, like bears, have fur for this purpose.\n3. **Knowledge**: While it's not directly related to keeping warm, having knowledge about survival skills could be useful in emergency situations where you need to find ways to stay warm.\n4. **Lipids (Fats)**: Fats are a type of energy storage in the body. They can be converted into energy to generate heat when needed.\n5. **Energy**: Energy is required for various activities, including keeping warm. Without energy, you cannot generate heat.\n6. **Water**: Water is essential for survival, but it's not directly related to keeping warm.\n7. **Data**: Data is not directly related to keeping warm.\n8. **Insulated material**: Insulation is a material with the ability to store heat. The answer is heat",
    "reward": [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN
    ],
    "ref_prob": [...],
    "weight_regular": [...]
}

I think it was abnormal. I used vicuna-7b-1.5 and gpt-4 as the base model and teacher model to reproduce the process and reduced the teacher model data to 512 for debugging (also for saving tokens xD). I didn't change the parameters in those shell scripts. Instead of using torchrun for data parallel and bf16, I chose model parallel to train the grm with float16on 4 V100 (32G).

Is it related to the trained grm caused by the small teacher model dataset? I would greatly appreciate it if you could share some datasets at your convenience.

Thanks again for your open-sourcing. Looking forward to your prompt reply.

Timothy023 commented 10 months ago

Thanks for your interest in our work.

For the problem of NaN, I guess it is caused by float16 training process of the grm. I think a potential solution is to change the hyper-parameters, like setting a smaller learning rate.

For the training datasets, we will consider whether to release them.

Hope my answer can help you.

liminghao0914 commented 10 months ago

TY, I'll try it.