gpu and cpu get different result

JasperKirk commented 3 years ago

Thank you for your excellent work. But I have a question. I trained the model with GPU, and the SoftPool2d parameter force_inplace was set to the default value False. But the results I measured using GPU and CPU in the test are different, and the mAP of CPU is much lower than that of GPU. And if I change the force_inplace=True in the test, the mAP of the GPU is equal to that of the CPU, which is much lower than the previous GPU（force_inplace=False）

alexandrosstergiou commented 3 years ago

Hi @JasperKirk ,

So the soft_pool2d() function includes two implementations at the moment. The first is the one that uses the class CUDA_SOFTPOOL2d and the CUDA code from softpool_cuda. The second one is the in-place version based on avg_pool2d that can be used for CPU or GPU (if you want to run everything in-place). If you are using force_inplace=True, you are essentially using the CPU version implemented with standard PyTorch functions.

I have not tested the second implementation, but I am not surprised that there is indeed some difference in performance. There is a residual error between the two - you can see that by lowering the threshold value in line 13 of the test.py. I believe this is mainly due to:

That there is no way of implementing SoftPool in PyTorch (at least with reasonable speeds) without changing the functionality of PyTorch functions such as average pooling or convs.
That the PyTorch python code does not (currently) do any clipping, so exponential weights could take zero values.
In general in-place operation should be avoided as essentially the new object created will occupy the same memory location as the previous one. This is -generally- fine for forward passes however, I am not entirely sure how gradient calculations are affected by it. It would be possible to make python operations non-inplace for the CPU version however, that means significantly larger memory usage and running times (since new objects will need to both be created and stored).

So To sum up the above, I highly suggest using the CUDA function as it is the one that is primarily intended to be used in networks.

Out of curiosity, how large is the accuracy margin between the two versions?

Best, Alex

JasperKirk commented 3 years ago

Hi, I use the CUDA function (CUDA_SOFTPOOL2d) to train, the data is PASCAL VOC ,the task is object detection.Due to time constraints, I chose 128 pictures for testing. All settings of the code are the same, the only difference is the use of CPU or GPU，but the margin between the two is too large. GPU : mAP@0.5=0.918, mAP@0.5:0.95=0.622 CPU : mAP@0.5=0.363, mAP@0.5:0.95=0.091 So, if I use the CUDA function for train, how can I test on CPU?

JasperKirk commented 3 years ago

I did an experiment：

 class SP(nn.Module):
      def __init__(self, k=2):
          super(SP, self).__init__()
          self.m = SoftPool2d(kernel_size=k, stride=2, force_inplace=True)
          self.m2 = SoftPool2d(kernel_size=k, stride=2)

      def forward(self, x):
          x1 = self.m(x)
          x2 = self.m2(x)
          return x1,x2

  x = torch.rand(1,3,4,4)
  print("x:",x)

  device = 'cpu'
  x1 = x.to(device)
  print("x1:",x1)

  device1 = '0'    #gpu
  x2 = x.to(device1)
  print("x2:",x2)

  m = SP(2)

  y1,y2 = m(x1)
  print(y1,y2)

  y3,y4 = m(x2)
  print(y3,'\n',y4,'\n')

so x1==x2, but x1 in cpu, x2 in gpu. the result is y1==y2,but y3 != y4.

There is a big gap between the two versions（y3 VS. y4）

alexandrosstergiou commented 3 years ago

Since there are differences between the two implementations (see the first comment) I don't think that training with the custom GPU method and testing with the python version is sensible. If you absolutely want to validate your data on the CPU, but also do the training on the GPU, you should set the in-place flag to True force_inplace=True for both training and testing. That way both CPU and GPU will run the same implementation.

On your second comment: Yes, that is correct. `y1`,`y2`,`y3` use the python implementation. `y4` uses the CUDA custom implementation.		CPU	GPU
inplace=True	python implementation (`y1`)	python implementation (`y3`)
inplace=False	python implementation (`y2`)	CUDA implementation (`y4`)

You can have a look at lines 180--198 if you are still unsure.

(Edit: Ren Tianhe has created a repo where his implementation of SoftPool was based on the python version (link). So the pythonic implementation should also work -- given that the CUDA one is not used for either training/testing).

Best, Alex

JasperKirk commented 3 years ago

I understand what you said, I thought the same way before, so I ran the Python version again, and it seems normal at the moment. At present, maybe this is the only way. But it seems that the python implementation is slower than the CUDA custom implementation in GPU（I am not sure）.If this is the case, is it possible to provide a faster version on the GPU?

Finally, thank you very much for your answers.

alexandrosstergiou commented 3 years ago

Unfortunately, creating a python version for PyTorch will be bound to use torch's classes and functions (which will be slower than a native CUDA implementation). There is the alternative of using convolutions instead of avg_pooling however if again compared to a CUDA-native implementation it will definitely run slower and require more memory.

In any case, I do think the CUDA version makes more sense to use if you are to include SoftPool as part of your model.

Thank you for your interest and hope the answers helped! :)

Best, Alex

JasperKirk commented 3 years ago

Thank you for your answer, it is very helpful to me.I might choose two models, one using the SoftPool on the GPU, and the other using the normal pooling layer on the CPU.

alexandrosstergiou commented 3 years ago

That should also work. Thanks again for your interest! I'll close this issue for now, feel free to open a new one if there are any other problems.

Best, Alex

alexandrosstergiou / SoftPool

gpu and cpu get different result #31