Open BongkiLee opened 3 years ago
Does the error happen when only single gpu is used? Besides instead of a multi-gpu support problem, it seems more like a problem of data format. The pit_loss() func takes (time, speaker_num) shaped tensors as input. Have you checked whether the time dimension of your data matches?
在 BongkiLee @.**>,2021年4月27日 15:15写道: I set gpu: 4 in train.yaml to train to use multi gpu. But I got the error as below. raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) How can I avoid the error? The full error message is as follows. Traceback (most recent call last): File "eend/bin/train.py", line 63, in train(args) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/train.py", line 142, in train loss, label = batch_pit_loss(output, t) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in batch_pit_loss losses, labels = zip(loss_w_labels) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in losses, labels = zip(*loss_w_labels) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in pit_loss losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms]) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms]) File "/home/VI251703/.conda/envs/pytorch_eend/lib/python3.7/site-packages/torch/nn/functional.py", line 2827, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2]))
—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
There is no error when using a single GPU. When using multiple GPUs, it seems that nn.DataParallel divides the input data by the number of gpu and processes it and then merges it later.
1) The error when the number of gpu is set to 2 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([250, 2])) 2) The error when the number of gpu is set to 4 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2])) 3) The error when the number of gpu is set to 5 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([100, 2]))
It is weird that nn.DataParallel affects the time dimension. It is supposed to only do chunking on minibatches. Have you printed the shape of (ys, ts) in batch_pit_loss() func? This may help us dig into the problem.
在 BongkiLee @.***>,2021年4月27日 16:06写道: There is no error when using a single GPU. When using multiple GPUs, it seems that nn.DataParallel divides the input data by the number of gpu and processes it and then merges it later.
The error when the number of gpu is set to 2 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([250, 2])) The error when the number of gpu is set to 4 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2])) The error when the number of gpu is set to 5 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([100, 2]))
—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-827406610", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-827406610", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
Attach the screen capture that shows the dimensions of ys and ts when using 3 GPUs.
The error message is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([167, 2]))
So when the error occurs, the shape of ys and ts matches? And do you have any idea why the code can run at the first 25 steps? Does the 26th batch has some samples irregular?
When the error occurred, the shape of ys and ts was not the same.
The point where the error occurs is different for each execution. In other words, it occurs not only in 25 steps, but also in other steps. Does this code work fine without errors in your multi-gpu environment?
Currently I have no access to the server, so I can not test the code actually. I have only tested the single-gpu pipeline, and since the bottleneck is data processing which is on the cpu, multi-gpu can barely speed up the whole training process.
When the error occurred, the shape of ys and ts was not the same.
The point where the error occurs is different for each execution. In other words, it occurs not only in 25 steps, but also in other steps. Does this code work fine without errors in your multi-gpu environment?
Have you solved the multi-gpu running problems? I also have the same problem.
Have you solved the multi-gpu running problems? I also have the same problem.
No progress yet. I have no data or machine to reproduce the problem currently. So there may not be a fix in the near future. Hope the community can help.
在 yfchen97 @.***>,2021年6月30日 11:23写道: Have you solved the multi-gpu running problems? I also have the same problem.
—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
Thank you very much.
在2021-06-30 @.***写道:
No progress yet. I have no data or machine to reproduce the problem currently. So there may not be a fix in the near future. Hope the community can help.
在 yfchen97 @.***>,2021年6月30日 11:23写道: Have you solved the multi-gpu running problems? I also have the same problem.
—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
this problem is due to the unequal tensor size, you can change the way of extracting data or fix the dataloader to solve this problem.
I set gpu: 4 in train.yaml to train to use multi gpu. But I got the error as below. raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
How can I avoid the error? The full error message is as follows.
Traceback (most recent call last): File "eend/bin/train.py", line 63, in
train(args)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/train.py", line 142, in train
loss, label = batch_pit_loss(output, t)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in batch_pit_loss
losses, labels = zip(loss_w_labels)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in
losses, labels = zip( loss_w_labels)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in pit_loss
losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms])
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in
losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms])
File "/home/VI251703/.conda/envs/pytorch_eend/lib/python3.7/site-packages/torch/nn/functional.py", line 2827, in binary_cross_entropy_with_logits
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2]))