Open dnnspark opened 5 years ago
Hi,
Thanks for opening this issue. This indeed looks like a small accuracy issue. Given that our kernels were taken as is from Detectron, I'd expect it to also be present there.
cc @rbgirshick for visibility
Thanks for flagging. I've actually only used the GPU version. cc @wat3rBro who implemented the CPU version and might be interested in investigating.
Hi @dnnspark there're indeed numerical differences between CPU and GPU, it might comes from different order of summation. Another test case is https://github.com/pytorch/pytorch/blob/eb15587c993e7ac9e208ec6986addfb74910581a/caffe2/operators/roi_align_op_gpu_test.cc#L262 where it doesn't force bitwise equivalent. You mentioned 1e-5 is reasonable, how is it determined? Have you noticed this difference leads to regression of the end metrics?
Hi @wat3rBro, the 1e-5 is just what I think is the maximum difference that can occur due to numerical error caused by various non-algorithmic reasons (e.g. precision of type). I will use more relaxed criterion too in my unit tests (which I use cpu versions.)
🐛 Bug
There is a small but non-trivial error between gpu and cpu implementation of roi_align.
To Reproduce
Expected behavior
The error ranges from 0.0004 - 0.006, due to the randomness. I expected something that is less than 1e-5
Environment
PyTorch version: 1.0.0.dev20190110 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.5 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 CMake version: version 3.5.1
Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: Quadro M1200 Nvidia driver version: 410.79 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
Versions of relevant libraries: [pip] Could not collect [conda] Could not collect