bf16 precision - Githubissues

YangRui2015 / RiC

Code for the ICML 2024 paper "Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment"

38 stars 3 forks source link

bf16 precision #4

Closed Davido111200 closed 3 months ago

Davido111200 commented 3 months ago

Hi,

Thank you for sharing this amazing work! I have a question regarding the GPUs used to run this code. In your paper, you mentioned that V100 GPUs were utilized. However, I encountered an error while using V100 GPUs myself.

ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

I tried set the precision to fp16, but it results in another error. Is there a way to get around this? Thank you

YangRui2015 commented 3 months ago

Hi,

Thank you for sharing this amazing work! I have a question regarding the GPUs used to run this code. In your paper, you mentioned that V100 GPUs were utilized. However, I encountered an error while using V100 GPUs myself.

ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

I tried set the precision to fp16, but it results in another error. Is there a way to get around this? Thank you

Hi,

Thank you for reporting the issue. Yes, V100 GPU does not support bf16 training. This issue might be because I cleaned the code and tested it on an A6000 GPU before uploading it to Github.

To resolve this, please try commenting out the line 49 in tic/training.py where it says "bf16=True". This should help resolve the issue.

Davido111200 commented 3 months ago

Thank you for your response. I decided to go for A100 gpus to avoid precision problem

Also, can you please specify the torch version that you use? With my current torch==2.1.0, it seems to have a lot of issues regarding torch.dynamo, thus ric training can not be run

YangRui2015 commented 3 months ago

Thank you for your response. I decided to go for A100 gpus to avoid precision problem

Also, can you please specify the torch version that you use? With my current torch==2.1.0, it seems to have a lot of issues regarding torch.dynamo, thus ric training can not be run

I recently used torch==2.0.1 on A6000 and torch==2.1.2 on A100. In addition, V100 can use fp32 by default without specifying "bf16=True" and 'fp16=True'.

It seems that your CUDA and torch may be mismatched. You can reinstall torch and test with torch.cuda.is_available().

Davido111200 commented 3 months ago

I double-checked my CUDA compatibility, and it doesn't seem to be the problem. However, I found a way to work around the bug. Thanks for taking a look