Diagnosing test failures

clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL

Apache License 2.0

843 stars 237 forks source link

Diagnosing test failures #273

Open anadon opened 8 years ago

anadon commented 8 years ago

So I took the time to run the complete set of tests with the current devel branch on a linux system running Nvidia's 352.79 driver. One immediate issue is that I don't know what the appropriate way is to make the 415MB file available here. Next is dissecting what the series of errors mean and what configuration error, testing error, or divine intervention exists.

kknox commented 8 years ago

Hi @anadon We are also tracking errors on the nvidia platform; can you check your failures with what we see on this arrayfire dashboard?
Let me know if you see significantly different errors.

One important 'gotcha' when running clblas unit tests; the only confirmed working reference implementations for correctness checking are either MKL or netlib blas. Make sure to link to either one of those when building test-correctness/test-short.

anadon commented 8 years ago

There are different errors if I'm reading things correctly. What should the next step be?

kknox commented 8 years ago

Check to see if the failure errors are really small; our unit tests expect the results to be bit-exact. That's why the reference implementation is best to be MKL or Netlib BLAS. On your system, look to see if the test failures that you see (in addition to the ones on the arrayfire dashboard) are different by only a factor of 10e-6. Those are usually floating point rounding errors we are marking as failures.

anadon commented 8 years ago

/home/campus14/jrmarsha/clBLASorigin/clBLAS/src/tests/correctness/corr-gemm.cpp:183: Failure Value of: err Actual: -36 Expected: 0 waitForSuccessfulFinish() [ FAILED ] SelectedBig_2/GEMM.dgemm/1, where GetParam() = (0, 1, 0, 5777, 5333, 3000, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (75583 ms)

But I'm more concerned about the 200,000+ occurrences of this: Failed to create/enqueue buffer for a matrix.

Almost no tests were actually run.

pavanky commented 8 years ago

@anadon what gpu are you running on?

anadon commented 8 years ago

Quadro K620

tingxingdong commented 8 years ago

Hi, Anadon Do you run on Linux? If so, checkout the develop branch. By my last pull request #274, you can verify the gemm result correctness against the Netlib CBLAS, by running the "client executable" which should stay in the /staging/ dir.

anadon commented 8 years ago

Did that -- I'm trying to get a test system that works with any case before I start throwing my experimental code at it. And I did pull from the most recent development branch.

tingxingdong commented 8 years ago

by the client, you can specify any case (matrix size, transpose, ...) through command line. You do not necessarily code from scratch if you want to see the result correct or not.

anadon commented 8 years ago

...I just ran the full test correctness program? Is there something else I should have done to test it?

tingxingdong commented 8 years ago

"client" is a complementary tool to check correctness. It allows users to check a specific case they are interested real quick on Linux. (see the pull request #274)

The "full test correctness program" has already predefined and hard coded different cases. It is supposed to run long time.

I recommend you run the client if you are testing gemm/trmm on Linux.

anadon commented 8 years ago

I'm trying to make sure I have a fully working environment before I start messing around with more code. I need to find out what is wrong with my setup or clBLAS, not limit my testing.