请教一下，训练RWKV-4-Pile-3B-20221008-8023，提示错误

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

11.99k stars 825 forks source link

IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.2 torch 1.13.1+cu117

感谢指导，上述问题已解决，但是出现双卡训练崩溃问题，请教一下是什么原因导致？ Loading extension module fused_adam... Time to load fused_adam op: 2.221088409423828 seconds Loading extension module fused_adam... Time to load fused_adam op: 2.2072091102600098 seconds /opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead warnings.warn( /opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead warnings.warn( Bus error (core dumped)

BlinkDL / RWKV-LM

请教一下，训练RWKV-4-Pile-3B-20221008-8023，提示错误 #209