A bug in data_prefetcher

QitaoZhao / ContextAware-PoseFormer

The project is an official implementation of our paper "A Single 2D Pose With Context is Worth Hundreds for 3D Human Pose Estimation".

65 stars 4 forks source link

A bug in data_prefetcher #6

Closed what-is-available-for-name closed 7 months ago

what-is-available-for-name commented 7 months ago

Thank you for your excellent work! However, I found an error (maybe a bug?) in your implementation. First, you set the absolute 3d keypoint ground truth to relative coordinates by keypoints_3d_gt[:, :, 1:] -= keypoints_3d_gt[:, :, :1] keypoints_3d_gt[:, :, 0] = 0 in ContextAware-PoseFormer/ContextPose/mvn/datasets/utils.py line 44. And then the 0th keypoint's coordinate would be set to 0. And this would cause an error when evaluating the results after an epoch. Because in ContextAware-PoseFormer/ContextPose/mvn/models/loss.py , P_MPJPE loss, you devide X0 by 0 and generate nan in keypoints coordinate, which would raise an error in np.linalg.svd(H). Could you please tell me how to solve with this error?

QitaoZhao commented 7 months ago

https://github.com/QitaoZhao/ContextAware-PoseFormer/blob/a2456578e8cd25f9fd99dacdf81d2e3623ca127b/ContextPose/mvn/models/loss.py#L36-L46 X0 = keypoints_gt - muX should not be zero as muX is the mean over the joint dimension. We previously found that this issue may happen when running with multiple GPUs. Is that the case for you?

QitaoZhao commented 7 months ago

https://github.com/QitaoZhao/ContextAware-PoseFormer/blob/a2456578e8cd25f9fd99dacdf81d2e3623ca127b/ContextPose/train.py#L75-L85 https://github.com/QitaoZhao/ContextAware-PoseFormer/blob/a2456578e8cd25f9fd99dacdf81d2e3623ca127b/ContextPose/train.py#L109-L119 In our previous case, the error you mentioned may happen if we use the torch.utils.data.distributed.DistributedSampler in val_dataloader as in train_dataloader. Therefore, we removed it in val_dataloader in our current implementation, which should already fix the error. Could you please also check this?

what-is-available-for-name commented 7 months ago

https://github.com/QitaoZhao/ContextAware-PoseFormer/blob/a2456578e8cd25f9fd99dacdf81d2e3623ca127b/ContextPose/mvn/models/loss.py#L36-L46

X0 = keypoints_gt - muX should not be zero as muX is the mean over the joint dimension. We previously found that this issue may happen when running with multiple GPUs. Is that the case for you?

No, I met this case with just one GPU. And my code is consistent with yours about the dataloader part. But there is still some parts of normX equal to 0.

what-is-available-for-name commented 7 months ago

when debugging, i typed print((normX == 0).sum()) and it returned 10000

QitaoZhao commented 7 months ago

I suppose this happens because some parts of keypoints_gt are all zero. You can print out to check if this is the case. If so, there might be something wrong with data processing.

what-is-available-for-name commented 4 months ago

Sorry that I forgot to reply. The bug happened because I didn't train over all batches and it faded automatically when I train one epoch completely

Thanks so much !