fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.06k stars 450 forks source link

多卡分布式训练 #461

Open houdawang opened 5 months ago

houdawang commented 5 months ago

你好。我在trainer中设置了如下参数( trainer = Trainer( driver="torch", train_dataloader=dl["train"], evaluate_dataloaders=dl["dev"], device=[4,7], callbacks=callback, optimizers=optimizer, n_epochs=args.epoch, accumulation_steps=args.accumulation_steps, torch_kwargs = {'ddp_kwargs':{'find_unused_parameters':True}} ) trainer.run())确实是在两张卡上运行了起来 但是训练过程打印的loss:NAN,并且每个epoch打印的每个指标都是一个相同的值,请问问题出在哪里