w2vgrep is a command-line tool that performs semantic searches on text input using word embeddings. It's designed to find semantically similar matches to the query, going beyond simple string matching. Supports multiple languages. The experience is designed to be similar to grep.
Search for words similar to "death" in Hemingway's "The Old Man and the Sea" with context and line numbers:
curl -s 'https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-t.txt' \
| w2vgrep -C 2 -n --threshold=0.55 death
Output:
This command:
- Fetches the text of "The Old Man and the Sea" from Project Gutenberg Canada
- Pipes the text to w2vgrep
- Searches for words semantically similar to "death"
- Uses a similarity threshold of 0.55 (-threshold 0.55)
- Displays 2 lines of context before and after each match (-C 2)
- Shows line numbers (-n)
The output will show matches with their similarity scores, highlighted words, context, and line numbers.
Two files are absolutely needed:
Using install script:
# clone
git clone https://github.com/arunsupe/semantic-grep.git
cd semantic-grep
# run install:
# compiles using the local go compiler, installs in user/bin,
# downloads the model to $HOME/.config/semantic-grep
# makes config.json
bash install.sh
Binary:
From source (linux/osx):
# clone
git clone https://github.com/arunsupe/semantic-grep.git
cd semantic-grep
# build
go build -o w2vgrep
# download a word2vec model using this helper script (see "Word Embedding Model" below)
bash download-model.sh
Basic usage:
./w2vgrep [options]
If no file is specified, w2vgrep reads from standard input.
-m, --model_path= Path to the Word2Vec model file. Overrides config file
-t, --threshold= Similarity threshold for matching (default: 0.7)
-A, --before-context= Number of lines before matching line
-B, --after-context= Number of lines after matching line
-C, --context= Number of lines before and after matching line
-n, --line-number Print line numbers
-i, --ignore-case Ignore case.
-o, --only-matching Output only matching words
-l, --only-lines Output only matched lines without similarity scores
-f, --file= Match patterns from file, one pattern per line. Like grep -f.
w2vgrep
can be configured using a JSON file. By default, it looks for config.json
in the current directory, "$HOME/.config/semantic-grep/config.json" and "/etc/semantic-grep/config.json".
w2vgrep
requires a word embedding model in binary format. The default model loader uses the model file's extension to determine the type (.bin, .8bit.int). A few compatible model files are provided in this repo (models/). Download one of the .bin files from the models/
directory and update the path in config.json.
Note: git clone
will not download the large binary model files unless git lfs is installed in your machine. If you do not want to install git-lfs, just manually download the model .bin file and place it in the correct folder.
Facebook's fasttext group have published word vectors in 157 languages - an amazing resource. I want to host these files on my github account, but alas, they are too big and $$$. Therefore, I have provided a small go program, fasttext-to-bin, that can make w2vgrep
compatible binary models from this. (note: use the text files with ".vec.gz" extension, not the binary ".bin.gz" files)
# e.g., for a French model:
curl -s 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.vec.gz' | gunzip -c | ./fasttext-to-bin -input - -output models/fasttext/cc.fr.300.bin
# use it like so:
# curl -s 'https://www.gutenberg.org/cache/epub/17989/pg17989.txt' \
# | w2vgrep -C 2 -n -t 0.55 \
# -model_path model_processing_utils/cc.fr.300.bin 'château'
Alternatively, you can use pre-trained models (like Google's Word2Vec) or train your own using tools like gensim. Note though that there does not seem to be a standardized binary format (google's is different to facebook's fasttext or gensim's default save()). For w2vgrep
, because efficiently loading the large model is key for performance, I have elected to keep the simplest format.
To help troubleshoot the model, I added a synonym-finder.go
to ./model_processing_utils/
. This program will find similar words to the query word above any threshold in the model.
# build
cd model_processing_utils
go build synonym-finder.go
#run
synonym-finder -model_path path/to/cc.zh.300.bin -threshold 0.6 合理性
# Output
Words similar to '合理性' with similarity >= 0.60:
科学性 0.6304
合理性 1.0000
正当性 0.6018
公允性 0.6152
不合理性 0.6094
合法性 0.6219
有效性 0.6374
必要性 0.6499
The model files are large (Gigabytes). Each word is typically represented using 300 dimension, 32 bit floating point vectors. Reducing dimensionality, to 100 or 150 dimensions, can produce smaller, memory efficient, faster, more performant models with minimal (maybe even better) accuracy. In model_processing_utils/reduce-model-size
, I have written a program to reduce model dimensions. This can be used to reduce the size of any word2vec binary model used by w2vgrep. Use this like so:
# build
cd model_processing_utils/reduce-model-size
go build .
# run on large GoogleNews-vectors-negative300-SLIM.bin model (346MB) to make smaller
# GoogleNews-vectors-negative100-SLIM.bin model (117MB)
./reduce-pca -input ../../models/googlenews-slim/GoogleNews-vectors-negative300-SLIM.bin -output ../../models/googlenews-slim/GoogleNews-vectors-negative100-SLIM.bin
# use this smaller model in w2vgrep like so
curl -s 'https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-t.txt' | bin/w2vgrep.linux.amd64 -n -t 0.5 -m models/googlenews-slim/GoogleNews-vectors-negative100-SLIM.bin --line-number death
Different models define "similarity" differently (explaination). However, for practical purposes, they seem equivalent enough.
Contributions are welcome! Please feel free to submit a Pull Request.
The code in this project is licensed under the MIT License.
go-flags package:
The go-flags package, used by the code in this project, is distributed under the BSD-3-Clause license. Please see the license information https://github.com/jessevdk/go-flags.
Word2Vec Model:
This project uses a mirrored version of the word2vec-slim model, which is stored in the models/googlenews-slim
directory. This model is distributed under the Apache License 2.0. For more information about the model, its original authors, and the license, please see the models/googlenews-slim/ATTRIBUTION.md
file.
GloVe word vectors:
This project uses a processed version of the GloVe word vectors, which is stored in the models/glove
directory. This work is distributed under the Public Domain Dedication and License v1.0. For more information about the model, its original authors, and the license, please see the models/glove/ATTRIBUTION.md
file.
Fasttext word vectors:
This project uses a processed version of the fasttext word vectors, which is stored in the models/fasttext
directory. This work is distributed under the Creative Commons Attribution-Share-Alike License 3.0. For more information about the model, its original authors, and the license, please see the models/fasttext/ATTRIBUTION.md
file.