预训练，用了160万数据，共2G句子对，使用A40的48G显存，无论使用1/2/3/4卡，都会报OOM

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to h ang. It is recommended to upgrade the kernel to the minimum version or higher.
Using auto half precision backend
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to h ang. It is recommended to upgrade the kernel to the minimum version or higher.
Running training
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 128
Gradient Accumulation steps = 8
Total optimization steps = 25,458
Number of trainable parameters = 223,395,072
0%| | 0/25458 [00:00<?, ?it/s] /root/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2692: UserWarn ing: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, u se padding='max_length'.
warnings.warn(
Running training
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 8
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 8
Total optimization steps = 50,918
Number of trainable parameters = 223,395,072
0%| | 0/25458 [00:01<?, ?it/s] {'loss': 10.8039, 'grad_norm': 44.39530563354492, 'learning_rate': 9.765625e-08, 'epoch': 0.0} [00:00<?, ?it/s] * Running training **** | 9/50918 [00:22<35:08:12, 2.48s/it] Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 4
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 8
Total optimization steps = 101,836
Number of trainable parameters = 223,395,072
0%| | 9/50918 [00:25<40:38:53, 2.87s/it] {'loss': 10.6074, 'grad_norm': 47.77275085449219, 'learning_rate': 9.765625e-08, 'epoch': 0.0}
{'loss': 10.1781, 'grad_norm': 27.392656326293945, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
0%| | 73/101836 [01:20<50:52:08, 1.80s/it] Running training
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 2
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 8
Total optimization steps = 203,674
Number of trainable parameters = 223,395,072
0%| | 73/101836 [01:23<32:14:29, 1.14s/it] {'loss': 9.493, 'grad_norm': 18.48982048034668, 'learning_rate': 9.765625e-08, 'epoch': 0.0}74 [00:00<?, ?it/s] {'loss': 9.3833, 'grad_norm': 24.369417190551758, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
{'loss': 9.328, 'grad_norm': 31.319684982299805, 'learning_rate': 9.765625e-06, 'epoch': 0.0}
Running training ***** | 146/203674 [01:53<50:35:35, 1.12it/s] Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 1
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 8
Total optimization steps = 407,348
Number of trainable parameters = 223,395,072
0%| | 146/203674 [01:56<45:07:49, 1.25it/s] {'loss': 9.116, 'grad_norm': 35.11280822753906, 'learning_rate': 9.765625e-08, 'epoch': 0.0}
{'loss': 8.8506, 'grad_norm': 42.10904312133789, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
{'loss': 8.7471, 'grad_norm': 67.85017395019531, 'learning_rate': 9.765625e-06, 'epoch': 0.0}
{'loss': 8.6334, 'grad_norm': 60.81837844848633, 'learning_rate': 1.4648437500000001e-05, 'epoch': 0.0}
{'loss': 8.4838, 'grad_norm': 69.64332580566406, 'learning_rate': 1.953125e-05, 'epoch': 0.0}
{'loss': 8.3629, 'grad_norm': 44.6363525390625, 'learning_rate': 2.44140625e-05, 'epoch': 0.0}
{'loss': 8.2015, 'grad_norm': 52.63124084472656, 'learning_rate': 2.9296875000000002e-05, 'epoch': 0.0}
0%| | 344/407348 [04:03<79:03:03, 1.43it/s] Traceback (most recent call last):
File "/root/t5/./pre_train.py", line 144, in
pre_train(config)
File "/root/t5/./pre_train.py", line 127, in pre_train
trainer.train(
File "/root/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/myenv/lib/python3.11/site-packages/accelerate/utils/memory.py", line 140, in deco rator
raise RuntimeError("No executable batch size found, reached zero.")

这个是一张 A40卡/48G 的运行日志。好像是gpu的内存没有释放

charent / ChatLM-mini-Chinese

预训练，用了160万数据，共2G句子对，使用A40的48G显存，无论使用1/2/3/4卡，都会报OOM #51