Open ereday opened 5 years ago
Those changes should be sufficient to enable multi-gpu training in my experience. Is there any other difference (e.g. batch size) between the two runs?
Nope, I did not change any of the variables in args
dictionary.
This is probably a silly question, but did you try this multiple times and receive the same results?
Yes, I run the code with the same configuration multiples times. There is no difference across different runs.
Sorry, I am not sure why this is happening. I recommend that you try the Simple Transformers library as it supports multi-gpu training by default and I have used multi-gpu training with that library without any performance degradation.
Hi,
When I run your code on multi-gpu, performance degrades severely (compared to the single-gpu version). To make the code multi-gpu competable, I've only added 2 lines of code:
model = nn.torch.DataParallel(model)
between yourmodel = model_class.from_pretrained(args['model_name'])
andmodel.to(device)
callsloss = loss.mean()
after theloss = outputs[0]
line in the train function. Do you have any idea how can I get the same (or similar) performance on Multi-GPU setting?These are the results I got with these two settings:
With Multi-GPU training: evaluate_loss: = 0.3928874781464829 fn = 116 fp = 81 mcc = 0.5114751200090137 tn = 1291 tp = 136
With Single-GPU Training: evaluate_loss: = 0.39542119007776766 fn = 82 fp = 126 mcc = 0.5465463104769824 tn = 1246 tp = 170
Although avg loss values are similar, there are big differences in other metrics.