facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.13k stars 3.62k forks source link

Unable to access index `metric_type` from ClientServer #2614

Open leothomas opened 1 year ago

leothomas commented 1 year ago

Summary

Hey y'all, thanks for putting this library together, and taking a look at this question.

I've got an index split up across multiple servers and a client server that connects to the sub-indexes using the socket library, in order to execute searches on the distributed index (loosely following this demo).

I want to be able to update/re-train the indexes with different parameters, especially switching between Cosine (inner product + normalization) and Euclidean (L2) distance metrics. While the client index, which executes the search, works the same way regardless of which distance metric is used, if the index I'm searching against uses the inner product search metric, I have to first normalize the search vectors.

I would like to access the sub indexes' metric_type parameter from the ClientIndex object, in order to know wether or not to normalize the search vectors before searching.

Something like this (I've added a minimum reproducible example at the bottom):

import faiss
from faiss.contrib.client_server import ClientIndex

index = ClientIndex(ips_ports) # ips_ports: List[Tuple(machine_host, port)]

if index.metric_type == "inner_product": # note: added a `get_metric_type()` method to the `ClientIndex` class
    faiss.normalize_L2(search_vectors)

distances, ids = index.search(search_vectors, 5) 

In order to add a metric_type parameter to the ClientIndex class, I've updated the ClientIndex class with a method (get_metric_type()) that queries the metric_type parameter of each sub-index, validates that they're all the same and return the numerical value corresponding to the metric type.


class ClientIndex:
    """manages a set of distance sub-indexes. The sub_indexes search a
    subset of the inverted lists. Searches are merged afterwards
    """

    def __init__(self, machine_ports: List[Tuple[str, int]], v6: bool = False):
        """connect to a series of (host, port) pairs"""
        self.sub_indexes = []
        for machine, port in machine_ports:
            self.sub_indexes.append(Client(machine, port, v6))

        self.ni = len(self.sub_indexes)
        # pool of threads. Each thread manages one sub-index.
        self.pool = ThreadPool(self.ni)
        # test connection...
        self.ntotal = self.get_ntotal()
        self.verbose = False
        self.metric_type = self.get_metric_type()

    ... 

    def get_metric_type(self) -> int:
        """Returns the distance metric of all sub-indexes. Raises an exception if
        not all sub-indexes have the same metric type
        """
        m = list(set(self.pool.map(lambda idx: idx.metric_type, self.sub_indexes)))

        # sub-indexes do not have the same distance matric - this is bad
        if not len(m) == 1:
            raise Exception("All sub-indexes must have the same metric_type")

        return m[0]

But the value returned by either: idx.metric_type or idx.metric_type() is not an integer as expected:

>>> idx.metric_type 
<function faiss.contrib.rpc.Client.__getattr__.<locals>.<lambda>(*x)>
>>>
>>> idx.metric_type() 
ServerException:   File "/Users/leo/development-seed/similarity-search/similarity-search-deploy/env-similarity-search/lib/python3.8/site-packages/faiss/contrib/rpc.py", line 134, in one_function
    ret = f(*args)
'int' object is not callable

Platform

macOS Ventura 13.0.1 (22A400)

Faiss version: faiss-cpu v1.7.2

Installed from: pip installed

Faiss compilation options: N/A

Running on:

Interface:

Reproduction instructions

Minimum reproducible example:

server.py

from faiss.contrib.client_server import run_index_server
import faiss

index = faiss.index_factory(128, "IVF4096,Flat")

# run on port 12010
run_index_server(index, 12010, v6=False)

client.py

from faiss.contrib.client_server import ClientIndex
client_index = ClientIndex([('localhost', 12010)])

print(f"sub-index metric type: {client_index.sub_indexes[0].metric_type} "

# Expected outcome: sub-index metric type: 1
# Actual outcome: sub-index metric type: <function Client.__getattr__.<locals>.<lambda> at 0x11a36d1f0>

Note: for the sake of the MRE, I didn't train and add vectors to the index, but I am getting the same results with a trained and populated index

Thanks y'all!

mdouze commented 1 year ago
m = list(set(self.pool.map(lambda idx: idx.metric_type, self.sub_indexes)))

shouldn't it call idx.get_metric_type() ?

leothomas commented 1 year ago

I'm not sure there is a get_metric_type() function defined on the index class:

>>> idx.get_metric_type()

ServerException:   File "/Users/leo/development-seed/similarity-search/similarity-search-deploy/env-similarity-search/lib/python3.8/site-packages/faiss/contrib/rpc.py", line 134, in one_function
    ret = f(*args)
local variable 'f' referenced before assignment