idariis commented 2 years ago

Summary

When trying to train faiss index, I get a segmentation fault. zsh: segmentation fault poetry run python examples/sandbox.py

Platform

OS:

Faiss version:

Installed from:

Faiss compilation options:

Running on:

[x] CPU
[ ] GPU

Interface:

[ ] C++
[x] Python

Reproduction instructions

import os

import datasets
import faiss
import numpy as np
import rich
import torch
from transformers import AutoModel
from transformers import AutoTokenizer

os.environ["KMP_DUPLICATE_LIB_OK"] = "True"
bert_id = "google/bert_uncased_L-2_H-128_A-2"

# load a text dataset
dataset = datasets.load_dataset("ptb_text_only", split="train")

# tokenize and format
tokenizer = AutoTokenizer.from_pretrained(bert_id)
dataset = dataset.map(
    lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length"),
    batched=True,
)
dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask"],
)
small_dataset = dataset.select(range(1000))
rich.print(small_dataset)

# load the bert model model
model = AutoModel.from_pretrained(bert_id)
model.eval()

# computer vector representations of documents using BERT
with torch.no_grad():
    batch = small_dataset[:]
    batch = tokenizer.pad(batch)
    document_representations = model(**batch).last_hidden_state
    # keep only the CLS vector for each document
    document_representations = document_representations[:, 0, :]

    # convert to contiguous numpy array
    document_representations = document_representations.numpy()
    document_representations = np.ascontiguousarray(document_representations)

rich.print(f"Document representations: {document_representations.shape}")

# faiss parameters + init
ndims = document_representations.shape[-1]
nlist = 3  # number of clusters
m = 8
quantiser = faiss.IndexFlatL2(ndims)
index = faiss.IndexIVFPQ(quantiser, ndims, nlist, m, faiss.METRIC_L2)

# attempting to add the vectors to the index
rich.print(f"Index is trained: {index.is_trained}")
index.train(document_representations) # <- this line throws segmentation fault
rich.print(f"Index is trained: {index.is_trained}")
index.add(document_representations)
rich.print(f"Total number of indices: {index.ntotal}")

k = 3
query = tokenizer("he is afraid of getting lung cancer")
xq = model(input_ids=query["input_ids"], attention_mask=query["attention_mask"])
xq = xq.last_hidden_state[:, 0, :].numpy()

# Perform search on index
distances, indices = index.search(xq, k)

klgraham commented 2 years ago

I'm getting the same problem. Huggingface Datasets 1.6.2 (I'm not using the latest) is pulling in faiss-cpu 1.7.1.post2. For now, I'm manually installing faiss-cpu 1.6.5 as a workaround.

mdouze commented 2 years ago

could you store the document_representations table and make a small script that loads the table and trains an index on it?

h-vetinari commented 2 years ago

If you have the time, you could try installing everything from conda-forge, I'd be interested to know if this segfault appears there as well:

conda install -c conda-forge faiss-gpu datasets

MihailMihaylov97 commented 2 years ago

Any updates on the issue?

I have the same issue on:

Monterey 12.0.1 python 3.9.7

I have installed faiss-cpu only through conda install -c pytorch faiss-cpu

I tried with smaller number of vectors, but even the tutorial here https://github.com/facebookresearch/faiss/blob/main/tutorial/python/1-Flat.py fails at the index.search(xb[:5], k) step with segmentation fault.

idariis commented 2 years ago

We have used the solution proposed by @klgraham and it works fine.

davidrs commented 11 months ago

If you are reading this issue in ~2023, 1.6.5 can no longer be installed in some environments. In that env we have found 1.7.0 to fix this issue.

facebookresearch / faiss

zsh: segmentation fault when running faiss on CPU #2099

Summary

Platform

Reproduction instructions