Open hairuoguo opened 1 month ago
This may also be a potential security vulnerability depending on what is actually happening under the hood. For example, I could modify the pymupdf vector class to include malicious code in the data() function, and the pymupdf proxy class would inadvertently be used, allowing for the code to be run whenever the .data() method is called.
this may be because both Faiss and pymupdf are wrapped with SWIG. LMC if there is a workaround for this case.
I think we could use SWIG_TYPE_TABLE to make a unique type table for Faiss. https://www.swig.org/Doc4.2/Modules.html#Modules_nn2 It seems that it just makes sure the table holding type names is distinct for Faiss.
@hairuoguo could you try to install Faiss through conda? and here is the instruction https://github.com/facebookresearch/faiss/blob/main/INSTALL.md . Thanks
will try this out when I have the time (next week or so), thanks
@hairuoguo I faced the same issue while using fitz but when I used PDFplumber there is no issue. You can try with PDFplumber it might work, but i need to do it with fitz , is there any way to do it without using conda.
Summary
Hello,
I am currently using the ColBERT model for a work project, which uses faiss. We had pymupdf installed in the same conda environment, as we are trying to work with scanned documents as a datasource.
ColBERT calls faiss's kmeans.train(), which led to an AssertionError on line 109 in vector_to_array.py (assert classname.endswith('Vector')). When I took a look at the input to that function it was a pymupdf proxy object instead of belonging to the expected "[dtype]Vector" classes defined in faiss.
This error disappeared after uninstalling pymupdf.
Platform
OS: Ubuntu 20.04.5 LTS (in docker container)
Faiss version: faiss-cpu 1.8.0.post1
Installed from: pip
Faiss compilation options: default flags
Running on:
Interface:
Reproduction instructions
Install faiss-cpu and pymupdf in conda environment using pip. Import fitz (pymupdf) and attempt to train faiss kmeans class
OR
Install ColBERT from ColBERT repo using instructions Install pymupdf import fitz (pymupdf) in code that runs ColBERT's Indexer class