criteo / autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.
https://criteo.github.io/autofaiss/
Apache License 2.0
804 stars 74 forks source link

Query Result Distances Appear in Descending Order for ANN Search #169

Closed ivishalanand closed 1 year ago

ivishalanand commented 1 year ago

I've been working with an index trained on approximately 3 million 512-dimension embeddings using the following configuration:

MAX_INDEX_MEMORY_USAGE = "25G"
CURRENT_MEMORY_AVAILABLE = "32G"
MAX_INDEX_QUERY_TIME_MS = 200

def build_index_function():
    """
    Build the index using the specified parameters.
    """
    start_time = time.time()
    index, index_infos = build_index(
        embeddings=EMBEDDINGS_FOLDER,
        file_format="parquet",
        embedding_column_name="image_embedding",
        temporary_indices_folder="autofaiss-indices",
        index_path="autofaiss.index",
        index_infos_path="infos.json",
        max_index_memory_usage=MAX_INDEX_MEMORY_USAGE,
        current_memory_available=CURRENT_MEMORY_AVAILABLE,
        max_index_query_time_ms=MAX_INDEX_QUERY_TIME_MS,
    )
    return index, index_infos

Indexing information was reported as follows:

INFO:autofaiss:{
    index_key: HNSW32
    index_param: efSearch=7636
    index_path: autofaiss.index
    size in bytes: 6924951466
    avg_search_speed_ms: 192.53737680040874
    99p_search_speed_ms: 306.6442671418192
    reconstruction error %: 0.0
    nb vectors: 2984710
    vectors dimension: 512
    compression ratio: 0.8827045373548037
}

However, when loading the faiss index and performing an ANN search, the distances seem to be in descending order. Based on my understanding, the distances for the nearest neighbour should be in ascending order.

Example:

index = faiss.read_index("autofaiss-full.index")
D, I = index.search(embedding_query_vector, 5)

print(D)

Output: array([[6.240374 , 6.2014914, 6.2014666, 6.1511683, 6.1026025], ... [8.429452 , 8.385169 , 8.345607 , 8.330268 , 8.313236 ]], dtype=float32)

print(I)

array([[2833057, 2256886, 1735613, 2845449, 2776100], ... [ 896596, 820252, 1919448, 2013961, 2935604]])

Is this the expected behavior or is there a possible issue with the order of distances returned?

hitchhicker commented 1 year ago

Hey! There are 2 similarity function sdefined by metric_type. For "ip" (the default one) which is inner product, it is expected behaviour.

ivishalanand commented 1 year ago

Okay, got it, Thanks!