Model performance degrades when moved to Multi-GPU

ereday commented 5 years ago

Hi,

When I run your code on multi-gpu, performance degrades severely (compared to the single-gpu version). To make the code multi-gpu competable, I've only added 2 lines of code:

model = nn.torch.DataParallel(model) between your model = model_class.from_pretrained(args['model_name']) and model.to(device) calls
loss = loss.mean() after the loss = outputs[0] line in the train function. Do you have any idea how can I get the same (or similar) performance on Multi-GPU setting?

These are the results I got with these two settings:

With Multi-GPU training: evaluate_loss: = 0.3928874781464829 fn = 116 fp = 81 mcc = 0.5114751200090137 tn = 1291 tp = 136
With Single-GPU Training: evaluate_loss: = 0.39542119007776766 fn = 82 fp = 126 mcc = 0.5465463104769824 tn = 1246 tp = 170

Although avg loss values are similar, there are big differences in other metrics.

ThilinaRajapakse commented 5 years ago

Those changes should be sufficient to enable multi-gpu training in my experience. Is there any other difference (e.g. batch size) between the two runs?

ereday commented 5 years ago

Nope, I did not change any of the variables in args dictionary.

ThilinaRajapakse commented 5 years ago

This is probably a silly question, but did you try this multiple times and receive the same results?

ereday commented 5 years ago

Yes, I run the code with the same configuration multiples times. There is no difference across different runs.

ThilinaRajapakse commented 5 years ago

Sorry, I am not sure why this is happening. I recommend that you try the Simple Transformers library as it supports multi-gpu training by default and I have used multi-gpu training with that library without any performance degradation.

ThilinaRajapakse / pytorch-transformers-classification

Model performance degrades when moved to Multi-GPU #29