facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.49k stars 3.56k forks source link

Possible class conflict between faiss-cpu and pymupdf #3689

Open hairuoguo opened 1 month ago

hairuoguo commented 1 month ago

Summary

Hello,

I am currently using the ColBERT model for a work project, which uses faiss. We had pymupdf installed in the same conda environment, as we are trying to work with scanned documents as a datasource.

ColBERT calls faiss's kmeans.train(), which led to an AssertionError on line 109 in vector_to_array.py (assert classname.endswith('Vector')). When I took a look at the input to that function it was a pymupdf proxy object instead of belonging to the expected "[dtype]Vector" classes defined in faiss.

This error disappeared after uninstalling pymupdf.

Platform

OS: Ubuntu 20.04.5 LTS (in docker container)

Faiss version: faiss-cpu 1.8.0.post1

Installed from: pip

Faiss compilation options: default flags

Running on:

Interface:

Reproduction instructions

Install faiss-cpu and pymupdf in conda environment using pip. Import fitz (pymupdf) and attempt to train faiss kmeans class

OR

Install ColBERT from ColBERT repo using instructions Install pymupdf import fitz (pymupdf) in code that runs ColBERT's Indexer class

hairuoguo commented 1 month ago

This may also be a potential security vulnerability depending on what is actually happening under the hood. For example, I could modify the pymupdf vector class to include malicious code in the data() function, and the pymupdf proxy class would inadvertently be used, allowing for the code to be run whenever the .data() method is called.

mdouze commented 1 month ago

this may be because both Faiss and pymupdf are wrapped with SWIG. LMC if there is a workaround for this case.

mdouze commented 1 month ago

I think we could use SWIG_TYPE_TABLE to make a unique type table for Faiss. https://www.swig.org/Doc4.2/Modules.html#Modules_nn2 It seems that it just makes sure the table holding type names is distinct for Faiss.

junjieqi commented 1 month ago

@hairuoguo could you try to install Faiss through conda? and here is the instruction https://github.com/facebookresearch/faiss/blob/main/INSTALL.md . Thanks

hairuoguo commented 1 month ago

will try this out when I have the time (next week or so), thanks

Luffy241 commented 6 days ago

@hairuoguo I faced the same issue while using fitz but when I used PDFplumber there is no issue. You can try with PDFplumber it might work, but i need to do it with fitz , is there any way to do it without using conda.