facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.29k stars 2.5k forks source link

Error between cpu/gpu version of roi_align #331

Open dnnspark opened 5 years ago

dnnspark commented 5 years ago

🐛 Bug

There is a small but non-trivial error between gpu and cpu implementation of roi_align.

To Reproduce

import torch
from maskrcnn_benchmark.layers.roi_align import _ROIAlign

def get_random_boxes(num_boxes):
    H,W = 480, 640
    x1 = torch.rand(num_boxes) * 0.7*W
    y1 = torch.rand(num_boxes) * 0.7*H
    w = torch.rand(num_boxes) * 0.5*W
    h = torch.rand(num_boxes) * 0.5*H
    x2,y2 = x1+w, y1+h

    return torch.stack([x1,y1,x2,y2], dim=1)

def test_roi_align():
    input = torch.randn(1,256,200,272) * 8.
    rois = torch.cat([torch.zeros(1000,1), get_random_boxes(1000)], dim=1)
    output_size = (7,7)
    spatial_scale = .25
    sampling_ratio = 2

    aligned_cpu = roi_align(input, rois, output_size, spatial_scale, sampling_ratio)
    aligned_gpu = roi_align(input.cuda(), rois.cuda(), output_size, spatial_scale, sampling_ratio)

    if not torch.allclose(aligned_cpu, aligned_gpu.cpu()):
        max_diff = torch.abs(aligned_cpu - aligned_gpu.cpu()).max()
        print('error=%.6f' % max_diff) 
    else:
        print("test_roi_align: OK")

    return

Expected behavior

The error ranges from 0.0004 - 0.006, due to the randomness. I expected something that is less than 1e-5

Environment

PyTorch version: 1.0.0.dev20190110 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: Quadro M1200 Nvidia driver version: 410.79 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries: [pip] Could not collect [conda] Could not collect

fmassa commented 5 years ago

Hi,

Thanks for opening this issue. This indeed looks like a small accuracy issue. Given that our kernels were taken as is from Detectron, I'd expect it to also be present there.

cc @rbgirshick for visibility

rbgirshick commented 5 years ago

Thanks for flagging. I've actually only used the GPU version. cc @wat3rBro who implemented the CPU version and might be interested in investigating.

wat3rBro commented 5 years ago

Hi @dnnspark there're indeed numerical differences between CPU and GPU, it might comes from different order of summation. Another test case is https://github.com/pytorch/pytorch/blob/eb15587c993e7ac9e208ec6986addfb74910581a/caffe2/operators/roi_align_op_gpu_test.cc#L262 where it doesn't force bitwise equivalent. You mentioned 1e-5 is reasonable, how is it determined? Have you noticed this difference leads to regression of the end metrics?

dnnspark commented 5 years ago

Hi @wat3rBro, the 1e-5 is just what I think is the maximum difference that can occur due to numerical error caused by various non-algorithmic reasons (e.g. precision of type). I will use more relaxed criterion too in my unit tests (which I use cpu versions.)