OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
548 stars 74 forks source link

[BUG] Signal killed caused by Adam Offload #164

Open MayDomine opened 1 year ago

MayDomine commented 1 year ago

Is there an existing issue for this?

Description of the Bug

When I try to run the finetune script of cpm-live-10b , adam offload optmizer caused the signal killed.And i tried to set the cpu level of adam to 0 which system wont use avx anymore.But it doesn't work.

Environment Information

- GCC version:9.4.0
- Torch version:1.13.0
- Linux system version:Ubuntu 20.04
- CUDA version:11.6+
- Torch's CUDA version (as per `torch.cuda.version()`):1.13 + 11.6
- CPU core: 48

To Reproduce

bash CPM-Bee/src/script/finetune_cpm_bee_10b.sh

Expected Behavior

It works normal

Screenshots

No response

Additional Information

No response

Confirmation