BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
11.99k stars 825 forks source link

请教一下,训练RWKV-4-Pile-3B-20221008-8023,提示错误 #209

Open XxSuper opened 7 months ago

XxSuper commented 7 months ago

torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1+cu118 deepspeed 0.12.4 pytorch-lightning 2.1.2 提示报错: AttributeError: "MyDataset' object has no attribute 'global rank'

BlinkDL commented 7 months ago

IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.2 torch 1.13.1+cu117

XxSuper commented 6 months ago

IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.2 torch 1.13.1+cu117

感谢指导,上述问题已解决,但是出现双卡训练崩溃问题,请教一下是什么原因导致? Loading extension module fused_adam... Time to load fused_adam op: 2.221088409423828 seconds Loading extension module fused_adam... Time to load fused_adam op: 2.2072091102600098 seconds /opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead warnings.warn( /opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead warnings.warn( Bus error (core dumped)

BlinkDL commented 6 months ago

要看具体错误,请截全