getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Issue with A10 card slight miscalculation when running tests #224

Closed gdurif closed 2 years ago

gdurif commented 2 years ago

From CI run (see this one for example)

[KeOps] Generating code for formula Max_SumShiftExpWeight_Reduction(Concat((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),Concat(1,Var(2,3,1))),0) ... OK
F
======================================================================
FAIL: test_invkernel (__main__.PytorchUnitTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "unit_tests_pytorch.py", line 406, in test_invkernel
    u.cpu().data.numpy().ravel(), u_.cpu().data.numpy().ravel(), atol=1e-4
AssertionError: False is not true

======================================================================
FAIL: test_softmax (__main__.PytorchUnitTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "unit_tests_pytorch.py", line 446, in test_softmax
    c.cpu().data.numpy().ravel(), cc.cpu().data.numpy().ravel(), atol=1e-6
AssertionError: False is not true

----------------------------------------------------------------------
Ran 16 tests in 12.409s

FAILED (failures=2)
joanglaunes commented 2 years ago

Hi @gdurif , Ok there are several things involved here !

gdurif commented 2 years ago

Regarding the first point, it is strange indeed. I don't know how unittest works. I'll do some digging.

A solution (but maybe too much work) would to switch from unittest to pytest which has nice features to manage multiple testing, dependencies between tests, test parametrization (which is now handled by hand in the test files and in the Jenkinsfile).

@joanglaunes @bcharlier what do you think ?

Regarding the second point, I think the cuda tests are now run on oban (according to the Jenkjns dashboard, oban is the only agent with the tag cuda.

joanglaunes commented 2 years ago

Ok I fixed the first point now in master, and about the second point, I identified that there is a weird bug, unrelated to KeOps, when using PyTorch v.1.10.2+cu113 (latest version I think..) with the A10 cards on oban. Nothing crashes but some computations are slightly incorrect.... So I discarded the A10 cards in Jenkinsfile and everything works. This is a temporary fix, but the main problem has nothing to do with KeOps. About using pytest, I agree we should do that !

gdurif commented 2 years ago

I renamed this issue toward the A10 card problem.

Regarding, the switch to pytest (and other testing, deployment, CI improvement), I created a dedicated project

bcharlier commented 2 years ago

The flag torch.backends.cuda.matmul.allow_tf32 = False has been added to the test as documented in https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices