dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Is there a good way to get the sorted similarity of each row of cosine similarity matrix #272

Closed bobbyliujb closed 6 years ago

bobbyliujb commented 6 years ago

I am following the introduction to tf-idf cos-similarity from text2vec. The returned similarity matrix is a sparse symmetric dsCMatrix. What I need is the top K most similar documents so I will need to sort on each row to get the index and similarity value. I am currently using apply to do the job but when the input dataset grows, it started to return error complaining the problem is too large. According to this answer, it is because apply will change the sparse matrix into a dense one.

#### given sim_matrix, get topK from each row and store in list of lists
get_topK_similar_items <- function(sim_matrix, K, min_sim_score) {
  candidates <- apply(sim_matrix, 1, get_topK_in_row, K, min_sim_score)
  return(candidates)
}
get_topK_in_row <- function(r, K, min_sim_score) {
  r <- r[r > min_sim_score]
  r <- sort(r, decreasing = TRUE)
  idx <- as.numeric(names(r))
  limit <- min(length(idx), K)
  if (limit < 1) {
    return(list())
  }
  r <- r[1:limit]
  return(lapply(seq_along(r), get_index_sim, idx, r))
}
get_index_sim <- function(i, idx, x) {
  return(c(idx[[i]], x[[i]]))
}
###################################

I know that using for-loop can definitely do this, but it is just way less efficient than using apply. I was wondering if text2vec has something can do this magic.

zachmayer commented 6 years ago

Try a different data structure, e.g.

rm(list=ls(all=T))
gc(reset=TRUE)

# Random sparse matrix
library(Matrix)
set.seed(42)
n_docs <- 1e4
m <- rsparsematrix(n_docs, n_docs, density=0.10, symmetric=T, rand.x=runif)

# Find top5 most similar docs
library(data.table)
dt <- data.table(summary(m))
setkeyv(dt, c('i', 'x'))  # i is row, j is col
dt[,rank := .N:1, by='i']
dt <- dt[rank <= 5,]
setkeyv(dt, c('i', 'rank'))
head(dt, 20)

If you need to make it even faster, try using sort(...,partial=TRUE) to pick the top 5, ignoring the sort order past 5. You'll need to modify my example somewhat.

bobbyliujb commented 6 years ago

Thanks Zach! After understanding your logic, I was able to get the top K as follows

dt <- data.table(summary(m))
setorder(dt, i, -x)                 # I don't know how to use setkeyv to sort in descending order on the value
dt[,rank := 1:.N, by='i']
head(dt[rank <= 5,], 20)

For the sort(..., partial = TRUE), I did not find any doc mentioning this. Could you elaborate more about it? Thanks!

zachmayer commented 6 years ago

On further consideration, I don’t actually think sort(partial) will help.

Setkey does the ordering and is really fast

Sent from my iPhone

On Jun 27, 2018, at 7:30 PM, bobbyliujb notifications@github.com wrote:

Thanks Zach! After understanding your logic, I was able to get the top K as follows

dt <- data.table(summary(m)) setorder(dt, i, -x) # I don't know how to use setkeyv to sort in descending order on the value dt[,rank := 1:.N, by='i'] head(dt[rank <= 5,], 20) For the sort(..., partial = TRUE), I did not find any doc mentioning this. Could you elaborate more about it? Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bobbyliujb commented 6 years ago

Thanks Zach! Actually when I tried on larger similarity matrix(40000 x 40000), I am still running out of memory when calling summary(d1_d2_tfidf_cos_sim): Error: vector memory exhausted (limit reached?). I will continue to see if there is anything other than getting the topK that I should improve. P.S. I am on Mac OS High Sierra with 16GB RAM and RStudio 1.1.453.

zachmayer commented 6 years ago

That’s a fairly large matrix!

At this point, doing a for loop is probably your best bet. It will be slow, but since you’ll look at one row at a time it won’t run out of ram.

If you really want to make my approach work, you could try replacing small similarities in your matrix with zeros to make it sparser. Eg m[m<0.10]=0. Then drop zeros from the sparse matrix— it should now be sparse and maybe my code will run.

Sent from my iPhone

On Jun 27, 2018, at 8:53 PM, bobbyliujb notifications@github.com wrote:

Thanks Zach! Actually when I tried on larger similarity matrix(40000 x 40000), I am still running out of memory when calling summary(d1_d2_tfidf_cos_sim): Error: vector memory exhausted (limit reached?). I will continue to see if there is anything other than getting the topK that I should improve. P.S. I am on Mac OS High Sierra with 16GB RAM and RStudio 1.1.453.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.