Closed haixpham closed 1 year ago
I've never tried distributed training and I'm not sure what's going on but it looks like it's something in the transformers lib itself. Everything below trainer.train()
is transformers lib code. Maybe there's something in the trainer config arguments that needs to be setup different? You might look through the hf_args
section of the config file and compare it to a distributed example in the transformers lib. I'd also check the transformers bug list to see if anyone has had issues training bart with distributed.
Indeed the error originated from transformers. I will head over there to look for the cause
Closing. No activity on issue for 1 month.
Hello,
I tried to train Model_Parse_XFM with BART-base backbone using torch.distributed.launch, nproc_per_node == 2. The error occurs as following:
It showed that the call to torch.embedding() caused the exception. I'm not sure what went wrong with BART in that regard! I tried to train T5 model, it went well. T5 and BART are mostly similar in terms of model archs.
Has anyone experienced that issue before?