Locality Sensitive Hashing in R

LSHR - fast and memory efficient package for near-neighbor search in high-dimensional data. Two LSH schemes implemented at the moment:

Minhashing for jaccard similarity
Sketching (or random projections) for cosine similarity. Most of ideas are based on brilliant Mining of Massive Datasets book.

Materials

Slides (in english) and video (in russian) from my talk at Moscow Data Science meetup.

Quick reference

# devtools::install_github('dselivanov/text2vec')
library(text2vec)
library(LSHR)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower, tokenizer = word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer())
dtm = as(dtm, "RsparseMatrix")

hashfun_number = 120
s_curve <- get_s_curve(hashfun_number, n_bands_min = 5, n_rows_per_band_min = 5)
# Examine S-curve.
# Find tradeoff between accuracy and false-positive rate.

S-curves

seed = 1
pairs = get_similar_pairs(dtm, bands_number = 10, rows_per_band = 32, distance = 'cosine', seed = seed)

pairs[order(-N)]

#        id1  id2  N
#    1: 1054 1417 10
#    2: 1084 3462 10
#    3: 1291 1356 10
#    4: 1615 3846 10
#    5: 2805 4763  4
#   ---             
# 2304: 4767 4961  1
# 2305: 4772 4776  1
# 2306: 4810 4859  1
# 2307: 4854 4945  1
# 2308: 4905 4918  1

dselivanov / LSHR

readme

Locality Sensitive Hashing in R

Materials

Quick reference