Anjum48 commented 5 years ago

Summary

Platform

OS: Ubuntu 19.04

Faiss version: Conda 1.5.1

Faiss compilation options:

Running on:

[ ] CPU
[x] GPU

Interface:

[ ] C++
[x] Python

Reproduction instructions

I'm getting repeatable memory errors using GPUs with 2xRTX 2080Tis. Is there an option I can set to get things moving? ```python def nearest_neighbours(query_vecs, reference_vecs, resources, top_k=100): nn_index = faiss.IndexFlatIP(reference_vecs.shape[1]) nn_index = faiss.index_cpu_to_all_gpus(nn_index) nn_index.add(reference_vecs) distances, indexes = nn_index.search(x=query_vecs, k=top_k) nn_index.reset() return distances, indexes ``` ``` terminate called after throwing an instance of 'faiss::FaissException' what(): Error in void faiss::gpu::allocMemorySpaceV(faiss::gpu::MemorySpace, void**, size_t) at gpu/utils/MemorySpace.cpp:27: Error: 'err == cudaSuccess' failed: failed to cudaMalloc 8998592512 bytes (error 2 out of memory) ```

wickedfoo commented 5 years ago

How much data are you adding in reference_vecs (# of vecs and dimension), and also what's the size of the query_vecs?

Anjum48 commented 5 years ago

Thanks for the reply! reference_vecs = (1067000, 2048) and query_vecs = (113000, 2048).

I have a suspicion that memory is not being released after an index is no longer needed however my understanding is that reset() should take care of this. Watching the memory usage in nvidia-smi I can definitely see the memory drop every time reset() is called but maybe not down to a level that is required for the next operation?

My pipeline calls this function first, then nearest_neighbours.

def average_query_expansion(query_vecs, reference_vecs, resources, top_k=5):
    """
    Average Query Expansion (AQE)
    Ondrej Chum, et al. "Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval,"
    International Conference of Computer Vision. 2007.
    https://www.robots.ox.ac.uk/~vgg/publications/papers/chum07b.pdf
    https://github.com/leeesangwon/PyTorch-Image-Retrieval/blob/public/inference.py
    """

    query_vecs = query_vecs.astype(np.float32)
    reference_vecs = reference_vecs.astype(np.float32)

    # Query augmentation
    query_aug = faiss.IndexFlatIP(reference_vecs.shape[1])
    query_aug = faiss.index_cpu_to_all_gpus(query_aug)
    query_aug.add(reference_vecs)
    distances, indexes = query_aug.search(x=query_vecs, k=top_k)
    query_aug.reset()

    top_k_ref_mean = np.mean(reference_vecs[indexes], axis=1, dtype=np.float32)
    query_vecs = np.concatenate([query_vecs, top_k_ref_mean], axis=1)

    # Reference augmentation
    ref_aug = faiss.IndexFlatIP(reference_vecs.shape[1])
    ref_aug = faiss.index_cpu_to_all_gpus(ref_aug)
    ref_aug.add(reference_vecs)
    distances, indexes = ref_aug.search(x=reference_vecs, k=top_k + 1)
    ref_aug.reset()

    top_k_ref_mean = np.mean(reference_vecs[indexes], axis=1, dtype=np.float32)
    reference_vecs = np.concatenate([reference_vecs, top_k_ref_mean], axis=1)

    return query_vecs, reference_vecs

average_query_expansion expands the embeddings to (N, 4096) before passing to nearest_neighbours.

However, sometimes I get the error in step 1, usually in step 2. With CPU only it works fine (albeit slowly).

Anjum48 commented 5 years ago

Things that I have tried/didn't work:

Using GpuIndexFlatIP with a single GPU
Using StandardGpuResources with a single GPU

Anjum48 commented 5 years ago

After playing with my toy dataset I think the issue might lie somewhere else in my code. I'll close this for now and reopen if I'm still facing issues

Anjum48 commented 5 years ago

Ok I'm still getting MemoryErrors with the following toy example which is based on an example from the docs:

import numpy as np
import faiss
import torch

def swig_ptr_from_FloatTensor(x):
    assert x.is_contiguous()
    assert x.dtype == torch.float32
    return faiss.cast_integer_to_float_ptr(
        x.storage().data_ptr() + x.storage_offset() * 4)

def swig_ptr_from_LongTensor(x):
    assert x.is_contiguous()
    assert x.dtype == torch.int64, 'dtype=%s' % x.dtype
    return faiss.cast_integer_to_long_ptr(
        x.storage().data_ptr() + x.storage_offset() * 8)

def search_knn(res, xb, xq, k, D=None, I=None, metric=faiss.METRIC_INNER_PRODUCT):
    assert xb.device == xq.device

    xq_ptr = swig_ptr_from_FloatTensor(xq)
    nq, d = xq.size()

    xb_ptr = swig_ptr_from_FloatTensor(xb)
    nb, d2 = xb.size()
    assert d2 == d

    if D is None:
        D = torch.empty(nq, k, device=xb.device, dtype=torch.float32)
    else:
        assert D.shape == (nq, k)
        assert D.device == xb.device

    if I is None:
        I = torch.empty(nq, k, device=xb.device, dtype=torch.int64)
    else:
        assert I.shape == (nq, k)
        assert I.device == xb.device

    D_ptr = swig_ptr_from_FloatTensor(D)
    I_ptr = swig_ptr_from_LongTensor(I)

    faiss.bruteForceKnn(res, metric,
                        xb_ptr, nb,
                        xq_ptr, nq,
                        d, k, D_ptr, I_ptr)

    return D, I

index_embeddings = np.random.normal(size=(1000000, 2048)).astype(np.float32)
test_embeddings = np.random.normal(size=(100000, 2048)).astype(np.float32)

# move to pytorch & GPU
index_embeddings = torch.from_numpy(index_embeddings).cuda()
test_embeddings = torch.from_numpy(test_embeddings).cuda()

# resource object, can be re-used over calls
res = faiss.StandardGpuResources()

# put on same stream as pytorch to avoid synchronizing streams
res.setDefaultNullStreamAllDevices()

distances, indexes = search_knn(res, index_embeddings, test_embeddings, k=100)

Output:

Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/retrieval/faiss error.py", line 66, in <module>
    distances, indexes = search_knn(res, index_embeddings, test_embeddings, k=100)
  File "/home/anjum/PycharmProjects/retrieval/faiss error.py", line 48, in search_knn
    d, k, D_ptr, I_ptr)
RuntimeError: Error in void faiss::gpu::allocMemorySpaceV(faiss::gpu::MemorySpace, void**, size_t) at gpu/utils/MemorySpace.cpp:27: Error: 'err == cudaSuccess' failed: failed to cudaMalloc 1610612736 bytes (error 2 out of memory)

Any ideas on how to get things moving? I think I need to somehow batch the data to the GPU but the example given isn't entirely obvious to me

Anjum48 commented 5 years ago

Ok so it turns out quantization is my friend and IndexIVFPQ is my friend when it comes to memory issues.

The wiki says "The index types IndexFlat, IndexIVFFlat and IndexIVFPQ are implemented on the GPU, as GpuIndexFlat, GpuIndexIVFFlat and GpuIndexIVFPQ. In addition to their normal arguments, they take a resource object as input, along with index storage configuration options and float16/float32 configuration parameters."

It would be great to have an example somewhere that shows how to set the float16 option

ucasiggcas commented 4 years ago

hi,dear have the same problem, have you got success? could you pls help me ? thx

nobody-cheng commented 4 years ago

hi,dear have the same problem, have you got success? could you pls help me ? thx i have the same problem, have you got success??

mathlf2015 commented 3 years ago

hello ，do you have any solution? i met the same problem

wenjiaXu commented 3 years ago

For those who know Chinese, this blog clearly explains how to solve the ``out of memory" problem. For English users, please check the following code:

nlist = 100 m = 8 # number of bytes per vector k = 4 quantizer = faiss.IndexFlatL2(d) # this remains the same index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)

8 specifies that each sub-vector is encoded as 8 bits

index.train(xb) index.add(xb) D, I = index.search(xb[:5], k) # sanity check print(I) print(D) index.nprobe = 10 # make comparable with experiment above D, I = index.search(xq, k) # search print(I[-5:])

Yimin-Liu commented 3 years ago

https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU

All GPU indexes are built with a StandardGpuResources object (which is an implementation of the abstract class GpuResources). The resource object contains needed resources for each GPU in use, including an allocation of temporary scratch space (by default, about 2 GB on a 12 GB GPU), cuBLAS handles and CUDA streams.

Scratch memory The temporary scratch space via the GpuResources object is important for speed and to avoid unnecessary GPU/GPU and GPU/CPU synchronizations via cudaFree. All faiss GPU code strives to be allocation-free on the GPU, assuming temporary state (intermediate results of calculations and the like) can fit into the scratch space. The temporary space reservation can be adjusted to an amount of GPU memory and even set to 0 bytes via the setTempMemory method.

There are broadly two classes of memory allocations in GPU Faiss: permanent and temporary. Permanent allocations are retained for the lifetime of the index, and are ultimately owned by the index.

Temporary allocations are made out of a memory stack that GpuResources allocates up front, which falls back to the heap (cudaMalloc) when the stack size is exhausted. These allocations do not live beyond the lifetime of a top level call to a Faiss index (or at least, on the GPU they are ordered with respect to the ordering stream, and once all kernels are done on the stream to which all work is ordered, then that temporary allocation is no longer needed and can be reused or freed. Generally about 1 GB or so of memory should be reserved in this stack to avoid cudaMalloc/Free calls during many search operations.

If the scratch memory is too small, you may notice slowdowns due to cudaMalloc and cudaFree. The high water mark used of the scratch space can be inquired from the resources object, and so can be adjusted to suit actual needs.

muximus3 commented 2 years ago

For those who know Chinese, this blog clearly explains how to solve the ``out of memory" problem. For English users, please check the following code:

nlist = 100 m = 8 # number of bytes per vector k = 4 quantizer = faiss.IndexFlatL2(d) # this remains the same index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8) # 8 specifies that each sub-vector is encoded as 8 bits index.train(xb) index.add(xb) D, I = index.search(xb[:5], k) # sanity check print(I) print(D) index.nprobe = 10 # make comparable with experiment above D, I = index.search(xq, k) # search print(I[-5:])

The blog referenced is just a translation of the official document.

facebookresearch / faiss

Memory Errors #831

Summary

Platform

Reproduction instructions

8 specifies that each sub-vector is encoded as 8 bits