Xflick / EEND_PyTorch

A PyTorch implementation of End-to-End Neural Diarization
MIT License
98 stars 15 forks source link

error when using Multi-GPU (nn.DataParallel) #4

Open BongkiLee opened 3 years ago

BongkiLee commented 3 years ago

I set gpu: 4 in train.yaml to train to use multi gpu. But I got the error as below. raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))

How can I avoid the error? The full error message is as follows.

Traceback (most recent call last): File "eend/bin/train.py", line 63, in train(args) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/train.py", line 142, in train loss, label = batch_pit_loss(output, t) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in batch_pit_loss losses, labels = zip(loss_w_labels) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in losses, labels = zip(loss_w_labels) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in pit_loss losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms]) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms]) File "/home/VI251703/.conda/envs/pytorch_eend/lib/python3.7/site-packages/torch/nn/functional.py", line 2827, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2]))

Xflick commented 3 years ago

Does the error happen when only single gpu is used? Besides instead of a multi-gpu support problem, it seems more like a problem of data format. The pit_loss() func takes (time, speaker_num) shaped tensors as input. Have you checked whether the time dimension of your data matches?

在 BongkiLee @.**>,2021年4月27日 15:15写道: I set gpu: 4 in train.yaml to train to use multi gpu. But I got the error as below. raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) How can I avoid the error? The full error message is as follows. Traceback (most recent call last): File "eend/bin/train.py", line 63, in train(args) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/train.py", line 142, in train loss, label = batch_pit_loss(output, t) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in batch_pit_loss losses, labels = zip(loss_w_labels) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in losses, labels = zip(*loss_w_labels) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in pit_loss losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms]) File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms]) File "/home/VI251703/.conda/envs/pytorch_eend/lib/python3.7/site-packages/torch/nn/functional.py", line 2827, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2]))

—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

BongkiLee commented 3 years ago

There is no error when using a single GPU. When using multiple GPUs, it seems that nn.DataParallel divides the input data by the number of gpu and processes it and then merges it later.

1) The error when the number of gpu is set to 2 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([250, 2])) 2) The error when the number of gpu is set to 4 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2])) 3) The error when the number of gpu is set to 5 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([100, 2]))

Xflick commented 3 years ago

It is weird that nn.DataParallel affects the time dimension. It is supposed to only do chunking on minibatches. Have you printed the shape of (ys, ts) in batch_pit_loss() func? This may help us dig into the problem.

在 BongkiLee @.***>,2021年4月27日 16:06写道: There is no error when using a single GPU. When using multiple GPUs, it seems that nn.DataParallel divides the input data by the number of gpu and processes it and then merges it later.

The error when the number of gpu is set to 2 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([250, 2])) The error when the number of gpu is set to 4 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2])) The error when the number of gpu is set to 5 is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([100, 2]))

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-827406610", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-827406610", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

BongkiLee commented 3 years ago

Attach the screen capture that shows the dimensions of ys and ts when using 3 GPUs. image

The error message is as follows. ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([167, 2]))

Xflick commented 3 years ago

So when the error occurs, the shape of ys and ts matches? And do you have any idea why the code can run at the first 25 steps? Does the 26th batch has some samples irregular?

BongkiLee commented 3 years ago

When the error occurred, the shape of ys and ts was not the same. image

The point where the error occurs is different for each execution. In other words, it occurs not only in 25 steps, but also in other steps. Does this code work fine without errors in your multi-gpu environment?

Xflick commented 3 years ago

Currently I have no access to the server, so I can not test the code actually. I have only tested the single-gpu pipeline, and since the bottleneck is data processing which is on the cpu, multi-gpu can barely speed up the whole training process.

speaker-lover commented 3 years ago

When the error occurred, the shape of ys and ts was not the same. image

The point where the error occurs is different for each execution. In other words, it occurs not only in 25 steps, but also in other steps. Does this code work fine without errors in your multi-gpu environment?

Have you solved the multi-gpu running problems? I also have the same problem.

speaker-lover commented 3 years ago

Have you solved the multi-gpu running problems? I also have the same problem.

Xflick commented 3 years ago

No progress yet. I have no data or machine to reproduce the problem currently. So there may not be a fix in the near future. Hope the community can help.

在 yfchen97 @.***>,2021年6月30日 11:23写道: Have you solved the multi-gpu running problems? I also have the same problem.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

speaker-lover commented 3 years ago

Thank you very much.

在2021-06-30 @.***写道:

No progress yet. I have no data or machine to reproduce the problem currently. So there may not be a fix in the near future. Hope the community can help.

在 yfchen97 @.***>,2021年6月30日 11:23写道: Have you solved the multi-gpu running problems? I also have the same problem.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe. [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "url": "https://github.com/Xflick/EEND_PyTorch/issues/4#issuecomment-871067158", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

czecze123 commented 2 years ago

this problem is due to the unequal tensor size, you can change the way of extracting data or fix the dataloader to solve this problem.