facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.47k stars 3.56k forks source link

Inefficiency of IndexIVFFlatDedup::add_with_ids #1575

Open billyean opened 3 years ago

billyean commented 3 years ago

Summary

We're using IndexIVFFlatDedup::add_with_ids to replace the vector whenever there is any change, since the number of change is a lot. We randomly detect some memory leak. We check the source as follows It looks like add_with_ids will do memcpy for every entry that is adding, if to be added ids are a lot, this is going to cause problem, can we consolidate all vectors that are not duplicate before add, then call add_entries one time

// From IndexIVTFlat.cpp
void IndexIVFFlatDedup::add_with_ids(
           idx_t na, const float* x, const idx_t* xids)
{
    ...
    for (size_t i = 0; i < na; i++) {
        idx_t id = xids ? xids[i] : ntotal + i;
        int64_t list_no = idx [i];

        if (list_no < 0) {

        if (offset == -1) { // not found
            invlists->add_entry (list_no, id, (const uint8_t*) xi);
        } else {
            // mark equivalence
            idx_t id2 = invlists->get_single_id (list_no, offset);
            std::pair<idx_t, idx_t> pair (id2, id);
            instances.insert (pair);
            n_dup ++;
        }
        n_add++;
    }
// From InvertedLists.cpp
size_t ArrayInvertedLists::add_entries (
           size_t list_no, size_t n_entry,
           const idx_t* ids_in, const uint8_t *code)
{
    if (n_entry == 0) return 0;
    assert (list_no < nlist);
    size_t o = ids [list_no].size();
    ids [list_no].resize (o + n_entry);
    memcpy (&ids[list_no][o], ids_in, sizeof (ids_in[0]) * n_entry);
    codes [list_no].resize ((o + n_entry) * code_size);
    memcpy (&codes[list_no][o * code_size], code, code_size * n_entry);
    return o;
}

Platform

OS: CentOS 7

Faiss version:

Installed from:

Faiss compilation options:

Running on:

Interface:

Reproduction instructions

mdouze commented 3 years ago

Did you identify a memory leak then?

Indeed it could be that this is not very efficient. If you have large batches to replace it could be faster to do a remove_ids and then add_with_ids.

billyean commented 3 years ago

For our case, we do see after a couple of days running, the application eat up all of our memory. We did implement with calling remove_ids before calling add_with_ids, but we still observe the memory raising up after a long run.