Closed gdurif closed 2 years ago
Hi @gdurif , Ok there are several things involved here !
Jenkinsfile
because the overall CI run is marked as passing while some tests fail.Regarding the first point, it is strange indeed. I don't know how unittest
works. I'll do some digging.
A solution (but maybe too much work) would to switch from unittest
to pytest
which has nice features to manage multiple testing, dependencies between tests, test parametrization (which is now handled by hand in the test files and in the Jenkinsfile).
@joanglaunes @bcharlier what do you think ?
Regarding the second point, I think the cuda tests are now run on oban (according to the Jenkjns dashboard, oban is the only agent with the tag cuda
.
Ok I fixed the first point now in master, and about the second point, I identified that there is a weird bug, unrelated to KeOps, when using PyTorch v.1.10.2+cu113 (latest version I think..) with the A10 cards on oban. Nothing crashes but some computations are slightly incorrect.... So I discarded the A10 cards in Jenkinsfile and everything works. This is a temporary fix, but the main problem has nothing to do with KeOps.
About using pytest
, I agree we should do that !
I renamed this issue toward the A10 card problem.
Regarding, the switch to pytest
(and other testing, deployment, CI improvement), I created a dedicated project
The flag torch.backends.cuda.matmul.allow_tf32 = False
has been added to the test as documented in https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
From CI run (see this one for example)