fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

https://gitee.com/fastnlp/fastNLP

Apache License 2.0

3.07k stars 448 forks source link

基于fatsnlp，利用多GPU训练，但是tensor的维度出问题，怎么解决？ #331

Closed yysirs closed 4 years ago

yysirs commented 4 years ago

之前单个GPU运行没有问题，不知道怎么搞的，这两天总是OOM。尝试使用多GPU，但是运行时总是说tensor的维度不对。查了之后，发现多GPU运行会分割维度，导致不对。应该怎么解决？

yysirs commented 4 years ago

报的错误：RuntimeError: The size of tensor a (94) must match the size of tensor b (214) at non-singleton dimension 1

xuyige commented 4 years ago

您好，请问您使用多卡训练的时候是使用了DistTrainer还是Trainer？另外是否能提供一下您报错部分的训练代码呢？

yysirs commented 4 years ago

`if args.status == 'train': if torch.cuda.device_count() > 1: model = nn.DataParallel(model, device_ids=[0, 1]) model.to(device) trainer = Trainer(datasets['train'], model, optimizer, loss, args.batch, n_epochs=args.epoch, dev_data=datasets['dev'], metrics=metrics, callbacks=callbacks, dev_batch_size=args.test_batch, test_use_tqdm=False, check_code_level=-1, update_every=args.update_every, save_path="./model")

trainer.train()`

yysirs commented 4 years ago

使用的是Trainer

xuyige commented 4 years ago

建议去掉 model=nn.DataParallel这一段然后在trainer初始化的时候增加一行device=[0,1] if torch.cuda.device_count() > 1 else [0] trainer初始化的时候device参数将控制model及data会被放入哪个设备当中

yysirs commented 4 years ago

` if args.status == 'train':

model = nn.DataParallel(model, device_ids=[0, 1])

device = [0, 1] if torch.cuda.device_count() > 1 else [0]
trainer = Trainer(datasets['train'], model, optimizer, loss, args.batch,
                  n_epochs=args.epoch,
                  dev_data=datasets['dev'],
                  metrics=metrics,
                  callbacks=callbacks, dev_batch_size=args.test_batch,
                  test_use_tqdm=False, check_code_level=-1,
                  update_every=args.update_every,
                  save_path="./model")

trainer.train()

xuyige commented 4 years ago

` if args.status == 'train':

model = nn.DataParallel(model, device_ids=[0, 1])

device = [0, 1] if torch.cuda.device_count() > 1 else [0] trainer = Trainer(datasets['train'], model, optimizer, loss, args.batch, n_epochs=args.epoch, dev_data=datasets['dev'], metrics=metrics, callbacks=callbacks, dev_batch_size=args.test_batch, test_use_tqdm=False, check_code_level=-1, update_every=args.update_every, save_path="./model")
trainer.train()
`

将device作为参数传入Trainer初始化，比如

if args.status == 'train':

model = nn.DataParallel(model, device_ids=[0, 1])

device = [0, 1] if torch.cuda.device_count() > 1 else [0] trainer = Trainer(datasets['train'], model, optimizer, loss, args.batch, n_epochs=args.epoch, dev_data=datasets['dev'], metrics=metrics, callbacks=callbacks, dev_batch_size=args.test_batch, test_use_tqdm=False, check_code_level=-1, update_every=args.update_every, save_path="./model", device=device)

yysirs commented 4 years ago

RuntimeError: The size of tensor a (94) must match the size of tensor b (214) at non-singleton dimension 1 好像还是不行

xuyige commented 4 years ago

RuntimeError: The size of tensor a (94) must match the size of tensor b (214) at non-singleton dimension 1 好像还是不行

这样的话可能是模型内部forward的时候出现了问题，您可以自行打印相关的tensor的维度看一下

yysirs commented 4 years ago

但是我当用单GPU训练其他语料的时候是通的。换成多GPU就不行了。

xuyige commented 4 years ago

但是我当用单GPU训练其他语料的时候是通的。换成多GPU就不行了。

多GPU训练其他语料呢？是否是数据预处理的问题，或者是sampler的问题？采用的是fastnlp自带的sampler吗？

yysirs commented 4 years ago

多GPU训练其他语料也不行。也是报tensor维度不match。

yysirs commented 4 years ago

采用是fastnlp自带的sampler

yhcc commented 4 years ago

是否有传入seq_len这种只有一维的输入，然后利用了seq_len_to_mask()转换为了mask之类的操作？

yysirs commented 4 years ago

有的传入max_seq_len 进行位置编码

yhcc commented 4 years ago

那有可能是这个原因导致的，例如max_seq_len是[5, 4, 3, 7, 8, 9]，分到两个gpu，就成了[5, 4, 3], [7, 8, 9], 第一个卡上的最长序列就成了5，但是你的words输入是padding到长度为9的，所以第一个卡上的长度就对不上了。

yysirs commented 4 years ago

那应该怎么解决，把max_seq_len变成[[5, 4, 3, 7, 8, 9],[5, 4, 3, 7, 8, 9]]这样吗？

yhcc commented 4 years ago

额，这个要看具体模型了，根据具体模型进行修改了。不过你这样写应该不行（因为会导致batch对不上）。

yysirs commented 4 years ago

感谢，我尝试去修改一下。

yhcc commented 4 years ago

一般从输入的words拿句子的长度作为最大长度可以解决这个问题，而不是用seq_len中的最大的数字，你可以看看是不是这样能解决你的问题。

yysirs commented 4 years ago

max_seq_len我的计算方法就是从word中获取的最大值 max_seq_len = max(* map(lambda x: max(x['seq_len']), datasets.values()))

yhcc commented 4 years ago

我的意思是在forward里面的时候，因为这个必须要看到具体到模型才能知道咋写。大概意思就是forward中的max_seq_len通过传入的words（应该是一个batch_size x max_len的tensor）的shape来取。

yysirs commented 4 years ago

好的，我在好好想想。

yysirs commented 4 years ago

感谢，问题已经解决，阅读了模型中forward的代码，确实padding时出的问题。 mask= seq_len_to_mask(seq_len,max_len=max_seq_len) 把max_len设置为max_seq_len就ok了