Sharded index behave differ on different GPU when integrated with fairseq code #2297

Open lishaojun412 opened 2 years ago

lishaojun412 commented 2 years ago


OS: Ubuntu18.04

Faiss version: 1.7.2

Installed from: compiled

Faiss compilation options:

cmake -B build \ -DFAISS_ENABLE_GPU=ON \ -DFAISS_ENABLE_C_API=ON \ -DFAISS_ENABLE_PYTHON=ON \ -DBUILD_TESTING=ON \ -DCMAKE_CUDA_FLAGS="-gencode arch=compute_75,code=sm_75" \ -DPython_EXECUTABLE=/usr/bin/python3.6 \ .

Running on:


Reproduction instructions

I test the time exhaust with source code and find 2 questions: 1. The inclusive_scan method exhust the most time among all the step , about 90% times, is it normal? 2. I have a index of 20G, store on T4 GPU with 2 shards, then I search the index integrate with fairseq code. I find the inclusive_scan method behave different in the 2 shards GPU, one tokens about 3ms which run the fairseq inference code, another tokens 7 ms. The differ is huge and strange. Without fairseq integrating, the 2 shard gpu search time behave the same , about 3-4ms. the inclusive_scan code in IVFUtils.cu ``` double t0 = getmillisecs(); thrust::inclusive_scan( thrust::cuda::par(alloc).on(stream), prefixSumOffsets.data(), prefixSumOffsets.data() + totalSize, prefixSumOffsets.data()); CUDA_TEST_ERROR(); double t1 = getmillisecs(); printf("#%.3f %d ", t1 - t0, getCurrentDevice()); ``` the fairseq search code ``` index = faiss.read_index(args.indexfile, faiss.IO_FLAG_ONDISK_SAME_DIR) co = faiss.GpuMultipleClonerOptions() co.useFloat16LookupTables = True co.useFloat16 = True co.useFloat16CoarseQuantizer = True co.usePrecomputed = False co.indicesOptions = 2 # 0 cpu 2 fp16 3 fp32 co.verbose = True co.shard = True # the replicas will be made "manually" co.shard_type = 2 vres, vdev = make_vres_vdev(0, 2) index.threaded = True index = faiss.index_cpu_to_gpus_list(index, co=co, gpus=[0, 1]) faiss.downcast_index(index.at(0)).nprobe = 4 faiss.downcast_index(index.at(1)).nprobe = 4 dists, knns = index.search(queries.detach().cpu().float().numpy(), k) ```
lishaojun412 commented 2 years ago

I run the bench_gpu_sift1m.py script with time print on one T4 GPU, find the thrust::inclusive_scan time thrust::inclusive_scan alternative change like this(nprobe=8): #ms device_id #0.330 0 #1.074 0 #0.330 0 #1.074 0 #0.330 0 #1.074 0 ......