facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.38k stars 3.64k forks source link

CUDA 11 issue with brute force KNN distance calculation #1385

Closed dantegd closed 4 years ago

dantegd commented 4 years ago

Summary

After issues we ran in cuML's KNN regressor and classifier (that use FAISS underneath), we found out a CUDA 11 specific issue that seems to stem from half precision optimizations to cublasSgemmEx. The solution is very simple and I'll be glad to contribute a PR for it, but wanted to create an issue to document it for future reference (and future CUDA versions).

The issue is: a step in the distance calculations has some numeric discrepancies when compared to older CUDA versions, particularly this step of the distance calculations:

https://github.com/facebookresearch/faiss/blob/c97f8906511dbb97079eaf44329db5aa3216470a/faiss/gpu/impl/Distance.cu#L193

In particular it is easy to see the issue when inspecting resulting distances of bfKnn when run in CUDA11. I've attached a simple script along with data of enough size to notice the discrepancies (1000 rows, 30 columns and using k=10), with the vectors and queries being the same. Here are the first 20 entries of the resulting distances:

0.0195312
22.7321
27.8344
27.9221
28.6282
29.1412
30.1401
30.9479
31.3218
33.1479
0.131836
29.1755
29.7333
33.4289
35.2295
35.2607
36.1398
36.3448
37.0358
37.0606

As opposed to looking like something we would expect (and is what CUDA 10.1/2 produce, as well as CUDA11 with the fix):

0
22.7664
27.8207
27.9999
28.6924
29.0683
30.1964
30.9391
31.2164
33.1111
0
29.0369
29.7889
33.3521
35.2543
35.4309
36.0618
36.356
36.9778
37.0444

This numeric differences cause a bigger number of discrepancies than what we would expect from the gpu bfknn in many scenarios.This is easily solved by changing CUDA 11's cublasSgemmEx math mode in https://github.com/facebookresearch/faiss/blob/c97f8906511dbb97079eaf44329db5aa3216470a/faiss/gpu/utils/MatrixMult-inl.cuh, with:

# if __CUDACC_VER_MAJOR__ >= 11
  cublasSetMathMode(handle, CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
# endif
    return cublasSgemmEx(handle, transa, transb, m, n, k,
                        &fAlpha, A, cAT, lda,
                        B, cBT, ldb,

With that in place, the results are back to expected, and we've checked that we get correct results in our test suite.

Platform

OS: Linux/CUDA 11

Faiss version: 1.6.3 and last commit Faiss compilation options: with cmake (changing gpu arch depending on system I run):

cmake -B build . -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DBUILD_TESTING=ON  

with make:

LIBS=-pthread
CPPFLAGS=-w
LDFLAGS=-L${CMAKE_INSTALL_PREFIX}/lib
      ${CMAKE_CURRENT_BINARY_DIR}/faiss/src/faiss/configure
  --prefix=${CMAKE_CURRENT_BINARY_DIR}/faiss
  --with-blas=${BLAS_LIBRARIES}
  --with-cuda=${CUDA_TOOLKIT_ROOT_DIR}
  --with-cuda-arch=70;75;80
  -v

we usually link openblas for the with-blas, but that should not affect these results since they are CUBLAS based.

Running on:

Interface:

Reproduction instructions

See attached script. It is a simple C++ script that reads a set of vectors and calls bfKnn, for quick references this are the parameters used:

    args.metric = faiss::MetricType::METRIC_L2;
    args.metricArg = 0;
    args.k = 10;
    args.dims = 30;
    args.numVectors = 1000;
    args.numQueries = 1000;

faiss_repro.zip

wickedfoo commented 4 years ago

Oh wow. I have a machine with CUDA 11 and an A100 now, will look into this on Monday and send out a fix. Thanks for finding this!

wickedfoo commented 4 years ago

https://github.com/facebookresearch/faiss/pull/1388

wickedfoo commented 4 years ago

implemented in #1388