Closed xinxinxing closed 1 year ago
Hi @xinxinxing , the default parallel mode in fairseq is the data parallel. So if you are running on a single node with 4 GPU cards, you don't need to specify the distributed-word-size, just remove this arg and run in a normal way.
how to use the data parallel in r-drop. As I use the distributed-world-size ,the bleu is lower than the one gpu. Here is my train.sh