arunsupe / semantic-grep

grep for words with similar meaning to the query
MIT License
1.12k stars 26 forks source link

w2vgrep - Semantic Grep

w2vgrep is a command-line tool that performs semantic searches on text input using word embeddings. It's designed to find semantically similar matches to the query, going beyond simple string matching. Supports multiple languages. The experience is designed to be similar to grep.

Example Usage

Search for words similar to "death" in Hemingway's "The Old Man and the Sea" with context and line numbers:

curl -s 'https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-t.txt' \
    | w2vgrep -C 2 -n --threshold=0.55 death

Output: alt text

This command:

- Fetches the text of "The Old Man and the Sea" from Project Gutenberg Canada
- Pipes the text to w2vgrep
- Searches for words semantically similar to "death"
- Uses a similarity threshold of 0.55 (-threshold 0.55)
- Displays 2 lines of context before and after each match (-C 2)
- Shows line numbers (-n)

The output will show matches with their similarity scores, highlighted words, context, and line numbers.

Features

Installation

Two files are absolutely needed:

  1. the w2vgrep binary
  2. the vector embedding model file
  3. (Optionally, a config.json file to tell w2vgrep where the embedding model is)

Using install script:

# clone
git clone https://github.com/arunsupe/semantic-grep.git
cd semantic-grep

# run install:
#   compiles using the local go compiler, installs in user/bin, 
#   downloads the model to $HOME/.config/semantic-grep
#   makes config.json
bash install.sh

Binary:

  1. Download the latest binary release
  2. Download a vector embedding model (see below)
  3. Optionally, download the config.json to configure model location there (or do this from the command line)

From source (linux/osx):

# clone
git clone https://github.com/arunsupe/semantic-grep.git
cd semantic-grep

# build
go build -o w2vgrep

# download a word2vec model using this helper script (see "Word Embedding Model" below)
bash download-model.sh

Usage

Basic usage:

./w2vgrep [options] [file]

If no file is specified, w2vgrep reads from standard input.

Command-line Options

-m, --model_path=     Path to the Word2Vec model file. Overrides config file
-t, --threshold=      Similarity threshold for matching (default: 0.7)
-A, --before-context= Number of lines before matching line
-B, --after-context=  Number of lines after matching line
-C, --context=        Number of lines before and after matching line
-n, --line-number     Print line numbers
-i, --ignore-case     Ignore case. 
-o, --only-matching   Output only matching words
-l, --only-lines      Output only matched lines without similarity scores
-f, --file=           Match patterns from file, one pattern per line. Like grep -f.

Configuration

w2vgrep can be configured using a JSON file. By default, it looks for config.json in the current directory, "$HOME/.config/semantic-grep/config.json" and "/etc/semantic-grep/config.json".

Word Embedding Model

Quick start:

w2vgrep requires a word embedding model in binary format. The default model loader uses the model file's extension to determine the type (.bin, .8bit.int). A few compatible model files are provided in this repo (models/). Download one of the .bin files from the models/ directory and update the path in config.json.

Note: git clone will not download the large binary model files unless git lfs is installed in your machine. If you do not want to install git-lfs, just manually download the model .bin file and place it in the correct folder.

Support for multiple languages:

Facebook's fasttext group have published word vectors in 157 languages - an amazing resource. I want to host these files on my github account, but alas, they are too big and $$$. Therefore, I have provided a small go program, fasttext-to-bin, that can make w2vgrep compatible binary models from this. (note: use the text files with ".vec.gz" extension, not the binary ".bin.gz" files)

# e.g., for a French model:
curl -s 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.vec.gz' | gunzip -c | ./fasttext-to-bin -input - -output models/fasttext/cc.fr.300.bin

# use it like so:
# curl -s 'https://www.gutenberg.org/cache/epub/17989/pg17989.txt' \
#    | w2vgrep -C 2 -n -t 0.55 \
#           -model_path model_processing_utils/cc.fr.300.bin 'château'

Roll your own:

Alternatively, you can use pre-trained models (like Google's Word2Vec) or train your own using tools like gensim. Note though that there does not seem to be a standardized binary format (google's is different to facebook's fasttext or gensim's default save()). For w2vgrep, because efficiently loading the large model is key for performance, I have elected to keep the simplest format.

Testing the model by finding synonyms

To help troubleshoot the model, I added a synonym-finder.go to ./model_processing_utils/. This program will find similar words to the query word above any threshold in the model.

# build
cd model_processing_utils
go build synonym-finder.go

#run
synonym-finder -model_path path/to/cc.zh.300.bin -threshold 0.6 合理性

# Output
Words similar to '合理性' with similarity >= 0.60:
科学性 0.6304
合理性 1.0000
正当性 0.6018
公允性 0.6152
不合理性 0.6094
合法性 0.6219
有效性 0.6374
必要性 0.6499

Decreasing the size of the model files

The model files are large (Gigabytes). Each word is typically represented using 300 dimension, 32 bit floating point vectors. Reducing dimensionality, to 100 or 150 dimensions, can produce smaller, memory efficient, faster, more performant models with minimal (maybe even better) accuracy. In model_processing_utils/reduce-model-size, I have written a program to reduce model dimensions. This can be used to reduce the size of any word2vec binary model used by w2vgrep. Use this like so:

# build
cd model_processing_utils/reduce-model-size
go build .

# run on large GoogleNews-vectors-negative300-SLIM.bin model (346MB) to make smaller
# GoogleNews-vectors-negative100-SLIM.bin model (117MB)
./reduce-pca -input ../../models/googlenews-slim/GoogleNews-vectors-negative300-SLIM.bin -output ../../models/googlenews-slim/GoogleNews-vectors-negative100-SLIM.bin

# use this smaller model in w2vgrep like so
curl -s 'https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-t.txt' | bin/w2vgrep.linux.amd64 -n -t 0.5 -m models/googlenews-slim/GoogleNews-vectors-negative100-SLIM.bin --line-number death

A word about performance of the different embedding models

Different models define "similarity" differently (explaination). However, for practical purposes, they seem equivalent enough.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License and attribution:

The code in this project is licensed under the MIT License.

go-flags package:

The go-flags package, used by the code in this project, is distributed under the BSD-3-Clause license. Please see the license information https://github.com/jessevdk/go-flags.

Word2Vec Model:

This project uses a mirrored version of the word2vec-slim model, which is stored in the models/googlenews-slim directory. This model is distributed under the Apache License 2.0. For more information about the model, its original authors, and the license, please see the models/googlenews-slim/ATTRIBUTION.md file.

GloVe word vectors:

This project uses a processed version of the GloVe word vectors, which is stored in the models/glove directory. This work is distributed under the Public Domain Dedication and License v1.0. For more information about the model, its original authors, and the license, please see the models/glove/ATTRIBUTION.md file.

Fasttext word vectors:

This project uses a processed version of the fasttext word vectors, which is stored in the models/fasttext directory. This work is distributed under the Creative Commons Attribution-Share-Alike License 3.0. For more information about the model, its original authors, and the license, please see the models/fasttext/ATTRIBUTION.md file.

Sources of models in the web