Reward value error while generating training data for rlmec

liminghao0914 commented 10 months ago

Hi Timothy,

Thanks for sharing your impressive work. Besides some minor bugs in your code, I'm facing a major obstacle while generating training data for rlmec after rewriting the generated samples by a well-trained grm.

I followed the steps in your readme and eventually found that the rewards in generated training data are all Nan. One example item of the rlmec_qa.jsonl would be like

{
    "input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nthe ability to store what can be used for keeping warm\nOptions:\nlipids\nfats\nknowledge\nwater\nenergy\nfur\ndata\nheat\n\n### Response: Let's think step by step.\n",
    "output": "\n1. **Heat**: This is the most important factor in keeping warm. It's the energy required to warm up an object or person.\n2. **Fur**: It's an insulating layer that can keep your body warm. Many animals, like bears, have fur for this purpose.\n3. **Knowledge**: While it's not directly related to keeping warm, having knowledge about survival skills could be useful in emergency situations where you need to find ways to stay warm.\n4. **Data**: Data is not directly related to keeping warm.\n5. **Energy**: Energy is required for various activities, including keeping warm. Without energy, you cannot generate heat.\n6. **Water**: Water is essential for survival, but it's not directly related to keeping warm.\n7. **Lipids (Fats)**: Fats are a type of energy storage in the body. They can be converted into energy to generate heat when needed.\n8. **Fur**: As mentioned earlier, fur is an insulating layer that helps keep your body warm.\n\nSo, the best options for keeping warm are heat, fur, and lipids (fats).",
    "regular": "1. **Heat**: This is the most important factor in keeping warm. It's the energy required to warm up an object or person.\n2. **Fur**: It's an insulating layer that can keep your body warm. Many animals, like bears, have fur for this purpose.\n3. **Knowledge**: While it's not directly related to keeping warm, having knowledge about survival skills could be useful in emergency situations where you need to find ways to stay warm.\n4. **Lipids (Fats)**: Fats are a type of energy storage in the body. They can be converted into energy to generate heat when needed.\n5. **Energy**: Energy is required for various activities, including keeping warm. Without energy, you cannot generate heat.\n6. **Water**: Water is essential for survival, but it's not directly related to keeping warm.\n7. **Data**: Data is not directly related to keeping warm.\n8. **Insulated material**: Insulation is a material with the ability to store heat. The answer is heat",
    "reward": [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN
    ],
    "ref_prob": [...],
    "weight_regular": [...]
}

I think it was abnormal. I used vicuna-7b-1.5 and gpt-4 as the base model and teacher model to reproduce the process and reduced the teacher model data to 512 for debugging (also for saving tokens xD). I didn't change the parameters in those shell scripts. Instead of using torchrun for data parallel and bf16, I chose model parallel to train the grm with float16on 4 V100 (32G).

Is it related to the trained grm caused by the small teacher model dataset? I would greatly appreciate it if you could share some datasets at your convenience.

Thanks again for your open-sourcing. Looking forward to your prompt reply.

Timothy023 commented 10 months ago

Thanks for your interest in our work.

For the problem of NaN, I guess it is caused by float16 training process of the grm. I think a potential solution is to change the hyper-parameters, like setting a smaller learning rate.

For the training datasets, we will consider whether to release them.

Hope my answer can help you.

liminghao0914 commented 10 months ago

TY, I'll try it.

Timothy023 / RLMEC

Reward value error while generating training data for rlmec #2