facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
29.4k stars 3.48k forks source link

If the path contains Unicode characters, can not read_index and write_index #3073

Open huanggefan opened 9 months ago

huanggefan commented 9 months ago

Summary

If the path contains Unicode characters, can not read_index and write_index

Platform

OS: Windows 11

Python: Python 3.11.4

Faiss version: 1.7.4

Installed from: pip install faiss-cpu

Running on:

Interface:

Reproduction instructions

here is code:

import pathlib

import faiss
import numpy
import torch

class FlatL2Index():
    def __init__(self, root: pathlib.Path, dim: int = 1024):
        self.dim = dim

        param = f'Flat'
        measure = faiss.METRIC_L2

        self.faiss_index = faiss.index_factory(dim, param, measure)

    def load(self):
        f = str(self.root)
        self.faiss_index = faiss.read_index(f)

    def dump(self):
        f = str(self.root)
        faiss.write_index(self.faiss_index, f)

    def train(self, dataset: numpy.ndarray | torch.Tensor = None):
        if dataset is None:
            train_points = max(self.dim * 10, 39936)
            random_train_dataset: numpy.ndarray = numpy.random.random((train_points, dim)).astype(numpy.float32)
            self.faiss_index.train(random_train_dataset)

        if isinstance(dataset, torch.Tensor):
            dataset = dataset.cpu().detach().numpy()

        dataset = dataset.astype(numpy.float32)

        self.faiss_index.train(dataset)

    def append(self, iv: numpy.ndarray | torch.Tensor):
        if isinstance(iv, torch.Tensor):
            iv = iv.cpu().detach().numpy()

        iv = iv.astype(numpy.float32)

        self.faiss_index.add(iv)

    def search(self, query_iv: numpy.ndarray | torch.Tensor, top_k: int = 10):
        if isinstance(query_iv, torch.Tensor):
            query_iv = query_iv.cpu().detach().numpy()

        query_iv = query_iv.astype(numpy.float32)

        return self.faiss_index.search(query_iv, top_k)

if __name__ == "__main__":
    dim = 768
    root = pathlib.Path("Z:\\") / "中文" / "flatL2.index"

    index = FlatL2Index(root, dim)

    # index.load()

    print(type(index.faiss_index))

    train_dataset: numpy.ndarray = numpy.random.random((20, dim)).astype(numpy.float32)
    test_dataset: numpy.ndarray = numpy.random.random((20, dim)).astype(numpy.float32)
    query_iv: numpy.ndarray = numpy.random.random((1, dim)).astype(numpy.float32)

    index.train(train_dataset)
    index.append(test_dataset)
    index.search(query_iv)

    index.dump()

When performing faiss.read_index and faiss.write_index operations, if the path contains Unicode characters, you may encounter the following error:

RuntimeError: Error in __cdecl faiss::FileIOWriter::FileIOWriter(const char *)
    at D:\a\faiss-wheels\faiss-wheels\faiss\faiss\impl\io.cpp:98: 
        Error: 'f' failed: could not open Z:\中文\flatL2.index 
    for writing: No such file or directory
mdouze commented 9 months ago

This is because there is no unambiguous way of converting unicode to char * in the C++ code.

sulmz commented 8 months ago

oh, how to solve this problem, anyone have idea?

soonbee commented 4 months ago

I also encountered the same issue. While it's not a fundamental solution, I resolved it by saving the index file to a temporary path and then copying the file. Below is the code example.

import os
import shutil
import tempfile
import faiss
import numpy as np
from pathlib import Path
from uuid import uuid4

def get_temp_dir():
    # windows
    if os.name == "nt":
        return "/Temp"
    # linux, macos
    return "/tmp"

features = [
    [0, 0, 0, 0, 0],
    [1, 0, 0, 1, 0],
    [1, 1, 0, 0, 1],
    [0, 1, 0, 0, 1],
    [1, 1, 0, 1, 1],
    [1, 0, 0, 1, 1],
]

d = len(features[0]) # dimension
index = faiss.IndexFlatL2(d)

for ft in features:
    parsed = np.array([ft], dtype=np.float32)
    index.add(parsed)

dest_path = "/path/to/save/faiss.idx"
temp_dir = get_temp_dir()
if not Path(temp_dir).is_dir():
    Path(temp_dir).mkdir()

with tempfile.TemporaryDirectory(dir=temp_dir) as p:
    temp_file_path = Path(p) / str(uuid4())
    faiss.write_index(index, str(temp_file_path))
    shutil.move(str(temp_file_path), dest_path)

Since the OS user name can be included in the default temp directory, I specified a separate temp_dir. If the user name contains Unicode, the same problem can occur. If it is guaranteed that the user name does not include Unicode, the attribute dir can be omitted in tempfile.TemporaryDirectory.

hansblafoo commented 1 week ago

Okay, that's a workaround for write_index but what do you do for read_index? If I understand this issue correctly, this problem also occurs for read_index so that you should encounter this problem as well when you want to read from the (now moved to the correct path) index file.

Algabeno commented 1 week ago

同样遇到了这个问题,我的业务场景必须使用到中文路径,请问有人解决了吗

soonbee commented 1 week ago

Okay, that's a workaround for write_index but what do you do for read_index? If I understand this issue correctly, this problem also occurs for read_index so that you should encounter this problem as well when you want to read from the (now moved to the correct path) index file.

Reading or writing an index is the same. Copy the index to be read to a temporary path with a filename that does not contain Unicode characters, then read the file using faiss.read_index.