Issue with A10 card slight miscalculation when running tests

gdurif commented 2 years ago

From CI run (see this one for example)

[KeOps] Generating code for formula Max_SumShiftExpWeight_Reduction(Concat((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),Concat(1,Var(2,3,1))),0) ... OK
F
======================================================================
FAIL: test_invkernel (__main__.PytorchUnitTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "unit_tests_pytorch.py", line 406, in test_invkernel
    u.cpu().data.numpy().ravel(), u_.cpu().data.numpy().ravel(), atol=1e-4
AssertionError: False is not true

======================================================================
FAIL: test_softmax (__main__.PytorchUnitTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "unit_tests_pytorch.py", line 446, in test_softmax
    c.cpu().data.numpy().ravel(), cc.cpu().data.numpy().ravel(), atol=1e-6
AssertionError: False is not true

----------------------------------------------------------------------
Ran 16 tests in 12.409s

FAILED (failures=2)

joanglaunes commented 2 years ago

Hi @gdurif , Ok there are several things involved here !

First there is a problem with the CI configuration in Jenkinsfile because the overall CI run is marked as passing while some tests fail.
Now about the failing tests themselves, it is not due to any update we made recently. I looked at it yesterday and it appears that these two tests started to fail in the CI two weeks ago when you installed the A10 cards on oban computer. Possibly the CI cannot run anymore on oban for some reason and the tests are performed on another computer (topdyn or herisson)...

gdurif commented 2 years ago

Regarding the first point, it is strange indeed. I don't know how unittest works. I'll do some digging.

A solution (but maybe too much work) would to switch from unittest to pytest which has nice features to manage multiple testing, dependencies between tests, test parametrization (which is now handled by hand in the test files and in the Jenkinsfile).

@joanglaunes @bcharlier what do you think ?

Regarding the second point, I think the cuda tests are now run on oban (according to the Jenkjns dashboard, oban is the only agent with the tag cuda.

joanglaunes commented 2 years ago

Ok I fixed the first point now in master, and about the second point, I identified that there is a weird bug, unrelated to KeOps, when using PyTorch v.1.10.2+cu113 (latest version I think..) with the A10 cards on oban. Nothing crashes but some computations are slightly incorrect.... So I discarded the A10 cards in Jenkinsfile and everything works. This is a temporary fix, but the main problem has nothing to do with KeOps. About using pytest, I agree we should do that !

gdurif commented 2 years ago

I renamed this issue toward the A10 card problem.

Regarding, the switch to pytest (and other testing, deployment, CI improvement), I created a dedicated project

bcharlier commented 2 years ago

The flag torch.backends.cuda.matmul.allow_tf32 = False has been added to the test as documented in https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices

getkeops / keops

Issue with A10 card slight miscalculation when running tests #224