[UPD] This repository is no longer actively maintained. We encourage and appreciate all efforts to improve SearchAnything.
"SearchAnything" is a local semantic search engine, powered by various AI models, which allows you to search sentences and images based on their semantic meanings.
Check out our demo video to see how it works. [Zhihu blog] [Twitter]
https://github.com/Immortalise/SearchAnything/assets/31989262/76ab3ad4-1f57-453f-acfc-dea7a310ae1d
https://github.com/Immortalise/SearchAnything/assets/31989262/8a7e0234-a35f-4cd8-936b-ea1ae9e0089a
First, clone our repository: git clone git@github.com:Immortalise/SearchAnything.git
We recommend using a Conda environment to manage your Python dependencies as it allows you to isolate your Python environment.
Use the following commands to set up the environment for "SearchAnything":
conda env create -f env.yaml
conda activate anything
Please note that on MacOS systems, executing conda env create -f env.yaml
may result in errors due to the CUDA packages and some other packages. We are currently addressing this issue and working on improving the MacOS environment compatibility.
Start the application by running python anything.py
in the console.
Upon running, you will see the following instructions:
[nltk_data] Downloading package punkt to /xxx/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Adding text embedding model
Adding image embedding model
SearchAnything v1.0 Type 'exit' to exit.
Type 'insert' to parse file.
Type 'search' to search file.
Type 'delete' to delete file.
Instruction:
Type insert
, followed by the file path. Please note that the file path can either be a single file or a directory. If a directory is specified, all supported files in the directory will be parsed and saved to the database.
When searching files, you can also use a more user-friendly web interface by running:
streamlit run app.py
In this local web interface, you can search files based on their semantic meanings.
We currently support the following file types:
"SearchAnything" primarily involves two steps:
Given a text or images, they are first processed into a vector (embedding). The main AI models for semantic search are based on the sentence-transformer repository.
Semantic search for text: all-mpnet-base-v2
Semantic search for images: clip-ViT-B-32
After generating the embedding for each image and text, we save the embedding along with the file path into a database.
When given a query and a search type, we process the query into an embedding $e_q$, then retrieve all embeddings $[e_1, e_2, ..., e_n]$ from the database. We then calculate the cosine similarity between the query embedding e_q
and each of $[e_1, e_2, ..., e_n]$, sort them in descending order, and return the results.
SearchAnything downloads the most advanced AI models to run locally, so there's no need to worry about your private data being compromised! Text semantic search only requires about 400MB of memory space, while image semantic search requires around 4GB of memory. We will add more models in the future to make it easier for users with different memory sizes to use.
We're eager to hear your valuable feedback and constructive suggestions!
Here are some features we plan to implement in the future:
Note the recent Github library: Semantra is similar to our SearchAnything and aims to facilitate semantic search of documents. SearchAnything differs from Semantra in the following ways: