Closed bobbyliujb closed 6 years ago
Try a different data structure, e.g.
rm(list=ls(all=T))
gc(reset=TRUE)
# Random sparse matrix
library(Matrix)
set.seed(42)
n_docs <- 1e4
m <- rsparsematrix(n_docs, n_docs, density=0.10, symmetric=T, rand.x=runif)
# Find top5 most similar docs
library(data.table)
dt <- data.table(summary(m))
setkeyv(dt, c('i', 'x')) # i is row, j is col
dt[,rank := .N:1, by='i']
dt <- dt[rank <= 5,]
setkeyv(dt, c('i', 'rank'))
head(dt, 20)
If you need to make it even faster, try using sort(...,partial=TRUE) to pick the top 5, ignoring the sort order past 5. You'll need to modify my example somewhat.
Thanks Zach! After understanding your logic, I was able to get the top K as follows
dt <- data.table(summary(m))
setorder(dt, i, -x) # I don't know how to use setkeyv to sort in descending order on the value
dt[,rank := 1:.N, by='i']
head(dt[rank <= 5,], 20)
For the sort(..., partial = TRUE)
, I did not find any doc mentioning this. Could you elaborate more about it? Thanks!
On further consideration, I don’t actually think sort(partial) will help.
Setkey does the ordering and is really fast
Sent from my iPhone
On Jun 27, 2018, at 7:30 PM, bobbyliujb notifications@github.com wrote:
Thanks Zach! After understanding your logic, I was able to get the top K as follows
dt <- data.table(summary(m)) setorder(dt, i, -x) # I don't know how to use setkeyv to sort in descending order on the value dt[,rank := 1:.N, by='i'] head(dt[rank <= 5,], 20) For the sort(..., partial = TRUE), I did not find any doc mentioning this. Could you elaborate more about it? Thanks!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks Zach!
Actually when I tried on larger similarity matrix(40000 x 40000), I am still running out of memory when calling summary(d1_d2_tfidf_cos_sim)
: Error: vector memory exhausted (limit reached?)
.
I will continue to see if there is anything other than getting the topK that I should improve.
P.S. I am on Mac OS High Sierra with 16GB RAM and RStudio 1.1.453.
That’s a fairly large matrix!
At this point, doing a for loop is probably your best bet. It will be slow, but since you’ll look at one row at a time it won’t run out of ram.
If you really want to make my approach work, you could try replacing small similarities in your matrix with zeros to make it sparser. Eg m[m<0.10]=0. Then drop zeros from the sparse matrix— it should now be sparse and maybe my code will run.
Sent from my iPhone
On Jun 27, 2018, at 8:53 PM, bobbyliujb notifications@github.com wrote:
Thanks Zach! Actually when I tried on larger similarity matrix(40000 x 40000), I am still running out of memory when calling summary(d1_d2_tfidf_cos_sim): Error: vector memory exhausted (limit reached?). I will continue to see if there is anything other than getting the topK that I should improve. P.S. I am on Mac OS High Sierra with 16GB RAM and RStudio 1.1.453.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I am following the introduction to tf-idf cos-similarity from text2vec. The returned similarity matrix is a sparse symmetric dsCMatrix. What I need is the top K most similar documents so I will need to sort on each row to get the index and similarity value. I am currently using apply to do the job but when the input dataset grows, it started to return error complaining the problem is too large. According to this answer, it is because apply will change the sparse matrix into a dense one.
I know that using for-loop can definitely do this, but it is just way less efficient than using apply. I was wondering if text2vec has something can do this magic.