Open TPolzer opened 8 years ago
I see tests failing randomly with a Quadro M1000M. I repeated mvn test multiple times. Any ideas?
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running com.ibm.gpuenabler.TestJavaCUDASuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 21.578 sec - in com.ibm.gpuenabler.TestJavaCUDASuite
Results :
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] --- maven-dependency-plugin:2.10:copy-dependencies (strip-native-lib-version-exec-test) @ gpu-enabler_2.10 ---
[INFO] jcuda:libJCudaDriver:so:linux-x86_64:0.7.0a already exists in destination.
[INFO] jcuda:libJCudaRuntime:so:linux-x86_64:0.7.0a already exists in destination.
[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (test) @ gpu-enabler_2.10 ---
Discovery starting.
Discovery completed in 194 milliseconds.
Run starting. Expected test count is: 27
CUDAFunctionSuite:
- Ensure CUDA kernel is serializable
[Stage 0:> (0 + 0) / 4]
- Run count()
- Run identity CUDA kernel on a single primitive column
- Run identity CUDA kernel on a single primitive array column
- Run identity CUDA kernel on a single primitive array in a structure
- Run add CUDA kernel with free variables on a single primitive array column
- Run vectorLength CUDA kernel on 2 col -> 1 col
- Run plusMinus CUDA kernel on 2 col -> 2 col
- Run applyLinearFunction CUDA kernel on 1 col + 2 const arg -> 1 col
- Run blockXOR CUDA kernel on 1 col + 1 const arg -> 1 col on custom dimensions
- Run sum CUDA kernel on 1 col -> 1 col in 2 stages
- Run map on rdds - single partition
- Run map on rdds - multiple partition - test empty partition
- Run reduce on rdds - single partition
- Run map + reduce on rdds - single partition
- Run map on rdds with 100,000 elements - multiple partition
- Run map + reduce on rdds - multiple partitions
[Stage 0:========> (9 + 8) / 64]
[Stage 0:==============> (16 + 8) / 64]
[Stage 0:=====================> (24 + 8) / 64]
[Stage 0:============================> (32 + 8) / 64]
[Stage 0:=================================> (38 + 8) / 64]
[Stage 0:=====================================> (42 + 8) / 64]
[Stage 0:=========================================> (47 + 8) / 64]
[Stage 0:==========================================> (48 + 8) / 64]
[Stage 0:=================================================> (56 + 8) / 64]
[Stage 0:=======================================================> (62 + 2) / 64]
- Run map + reduce on rdds with 100,000,000 elements - multiple partitions *** FAILED ***
-296974186 did not equal 1974919424 (CUDAFunctionSuite.scala:675)
- Run map + map + reduce on rdds - multiple partitions
- Run map + map + map + collect on rdds
- Run map + map + map + reduce on rdds - multiple partitions
- Run map on rdd with a single primitive array column *** FAILED ***
scala.this.Predef.intArrayOps(outputItr.next()).toIndexedSeq.sameElements[Int](scala.this.Predef.intWrapper(0).to(n.-(1))) was false (CUDAFunctionSuite.scala:812)
- Run map with free variables on rdd with a single primitive array column
- Run reduce on rdd with a single primitive array column *** FAILED ***
scala.this.Predef.intArrayOps(output).toIndexedSeq.sameElements[Int](scala.this.Predef.intWrapper(n).to(2.*(n).-(1)).map[Int, scala.collection.immutable.IndexedSeq[Int]](((x$8: Int) => x$8.*(2)))(immutable.this.IndexedSeq.canBuildFrom[Int])) was false (CUDAFunctionSuite.scala:885)
- Run map & reduce on a single primitive array in a structure
- Run logistic regression *** FAILED ***
382.29565646287256 was not less than 1.0E-7 (CUDAFunctionSuite.scala:1027)
- CUDA GPU Cache Testcase
Run completed in 2 minutes, 48 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 23, failed 4, canceled 0, ignored 0, pending 0
*** 4 TESTS FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] gpu-enabler-parent ................................. SUCCESS [ 0.002 s]
[INFO] mavenized-jcuda .................................... SUCCESS [ 1.121 s]
[INFO] gpu-enabler_2.10 ................................... FAILURE [03:21 min]
[INFO] GPU Enabler Examples ............................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:22 min
[INFO] Finished at: 2016-09-15T13:07:34+03:00
[INFO] Final Memory: 60M/945M
[INFO] ------------------------------------------------------------------------
I'm having the same issue. I've tried it on 4 separate computers. 3 have GTX 960s and one has a GTX 1060. I get a different number for the sum every time.
Thanks for your interest in this project.
One issue is related to integer overflow and it can be handle quickly. For the other failures related to x86 GPUs, we are looking into this issue and will update this post with our findings soon.
I have run the testsuite multiple times on a Tesla K20m GPU with Cuda/7.5 and the results look non-deterministic.
The number of failures was always between 3 and 9 (except two runs where com.ibm.gpuenabler.TestJavaCUDASuite failed and the rest was skipped). Here is a particularly bad run, with 9 failed tests:
One failure even produced a stacktrace (which might be helpful):