validate.r can't work - Githubissues

pommedeterresautee commented 8 years ago

Several references in validate.r seem to be not in line with the values returned by the other functions:
\- classes expected for `signature_matrix`
\- column names of `candidate_indices` data.table

Kind regards,
Michaël

pommedeterresautee commented 8 years ago

Possible improvement: using mcmapply instead of mapply in validate.r to get multithreading when computing approximative distances

dselivanov commented 8 years ago

This is surprise for me, somebody except me uses this package :-)
validate.R is quite outdated (and actually very inefficient).

Here is function that I use at the moment (m is binary matrix):

validate_jaccard <- function(m, i1, i2) {
  m1 <- m[i1, , drop = FALSE]
  m2 <- m[i2, , drop = FALSE]
  intersect <- rowSums(m1 * m2)
  union <- rowSums( sign( m1 + m2) ) 
  intersect / union
}

library(text2vec)
library(LSHR)
data("movie_review")
options( mc.cores = 4)

it <- itoken(movie_review$review, tolower, word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer()) %>% 
           transform_binary

hashfun_number = 60
sign_mat <- get_signature_matrix(dtm, hashfun_number, 'jaccard', seed = 12L)

candidate_indices <- get_similar_pairs(signature_matrix = sign_mat,
                                       bands_number = 5,
                                       verbose = T)

validate_jaccard(dtm, candidate_indices$id1, candidate_indices$id2)

dselivanov commented 8 years ago

Also keep in mind, subsetting like m[i1, , drop = FALSE] can be not so fast. And if you have many candidates ( > ~1e5), it worth to perform validation iteratively, say in batches of 1e4 candidates (or in parallel with mclapply).

pommedeterresautee commented 8 years ago

The package quality is good (as text2vec :-), it does its job in an efficient way. Only possible issue is the documentation. It may be difficult to use if you don't already know what to expect. Info about how to set parameters (and their meaning) would help (like number of bands, number of random projection...). May be some inspiration: https://github.com/FALCONN-LIB/FALCONN/wiki/LSH-Primer http://jsatml.blogspot.fr/2013/06/cosine-similarity-locality-sensitive.html

pommedeterresautee commented 8 years ago

Another thing for documentation, how threshold is computed (why not provided by user) and how to interpret S graph. I used these LSH algo long time ago and forgot about these two points (ok takes 5mn on Google, but still)...

dselivanov commented 8 years ago

@pommedeterresautee totally agree with you. One problem - need to find some time for writing docs and vignettes :-) thanks for sharing links.

pommedeterresautee commented 8 years ago

This is my project I try to find similar texts (for each text, the ones related). I understand your point about threshold, just with 120 projections and 8 bands, I got > 40 millions of pairs... My main issue is that too many documents have no similar ones (or too few). I ve found manually that up to cosine distance 0.4/0.5, documents share the same legal issue (few words in common, but the one shared make sense). That's why I try to go as low as possible.

pommedeterresautee commented 8 years ago

After testing, it's very hard to paralellize cosine computation as mclapply copy everything for each core. I have tried the methods from R blogger to limit memory footprint of mclapply but so far not enough good. The biggest object, the document term matrix, has to be copied. May be rcpp would be a good choice for this task. 2 threads (max I can get with my Ram), 173 millions of cosine to compute, it's 7 hours on my computer :(

dselivanov commented 8 years ago

Can you provide some code ? Did you tried mc.preschedule=F

pommedeterresautee commented 8 years ago

I can post the code tonight (I am at work now).

But I have not tried to change mc.preschedule (so default -> True). I used 2000 as batch size. May be option mc.preschedule would help multi threading according to its description.

pommedeterresautee commented 8 years ago

@dselivanov my code below


require(LSHR)
require(magrittr)
require(parallel)

# Functions
# Create batch of IDs to analyze
id.splitter <- function(size, group = 2000) split(seq(size), ceiling(seq(size)/group))
# No need to normalize before each call, matrix is already normalized
compute.cosine <- function (m, i1, i2) rowSums(m[i1, , drop = FALSE] * m[i2, , drop = FALSE])

# 250K documents TF IDF already normalized
load(file = "decision.tfidf")
gc()

hashfun_number = 120
sign_mat <- get_signature_matrix(dtm.idf.decision, hashfun_number = hashfun_number, measure = 'cosine', seed = 12L)
gc()
# 173 millions of candidates
candidate_indices <- get_similar_pairs(signature_matrix = sign_mat, bands_number = 10, verbose = T)

t2 <- Sys.time()
result.parallel <- mclapply(X = id.splitter(nrow(candidate_indices)), 
                            FUN = function(indices) compute.cosine(dtm.idf.decision, candidate_indices[indices, id1], candidate_indices[indices, id2]), 
                            mc.cores = 2) %>% unlist # mclapply keeps order according to doc + tests, so unlist is ok
# result <- c()
# for(indices in id.splitter(nrow(candidate_indices))){
#   result <- c(result, compute.cosine(dtm.idf.decision, candidate_indices[indices, id1], candidate_indices[indices, id2]))
#   (tail(indices,1)*100/nrow(candidate_indices)) %>% print
# }
candidate_indices[, cos := result.parallel]
print(difftime(Sys.time(), t2, units = 'sec'))

Links of interest:

http://brainchronicle.blogspot.fr/2012/06/rcpp-vs-r-implementation-of-cosine.html : get an idea of what to expect with RCPP but code doesn't seem optimized
http://dirk.eddelbuettel.com/code/rcpp/html/math_8h_source.html : Rcpp includes cosine function (probably for dense matrix)
http://stackoverflow.com/questions/28029445/nested-for-loops-in-r-using-foreach-function-and-doparallel-library : just show optimized cosine computation with R syntax

dselivanov commented 8 years ago

No need to compute cosine with Rcpp. With fast BLAS (like openBLAS) pure R will be faster than Rcpp.

One quick thought - you can remove get_similar_pairs just after get_similar_pairs() call. This should release a lot of memory.

Also setting mc.preschedule = FALSE + larger group (say 1e5) should help to speedup!

pommedeterresautee commented 8 years ago

@dselivanov you can remove get_similar_pairs after get_similar_pairs() call. This should release a lot of memory. -> what should I remove? Do you mean call garbage collector?

I use openBlas on Ubuntu. I tried Intel MKM on Microsoft R version, but parallelization is only for dense matrix (I knew... but I hopped...).

dselivanov commented 8 years ago

Sorry, remove sign_mat - rm(sign_mat)

pommedeterresautee commented 8 years ago

Just to understand, when mclapply fork the environment, the signature matrix is not taken, right? teh doc says that only modified variables are taken in the copy.

dselivanov commented 8 years ago

From my experience it is hard to predict what will be copied... But according to docs - yes.

pommedeterresautee commented 8 years ago

BTW I made some search, on Pyhton / Java / Scala world, no awesome idea to compute cosine more rapidly. Always tproduct stuff :-( I have tried at lunch using the approximative cosinus guessed by random projection -> too much approximative (I didn't check perf difference). Just currious, what are you doing in NLP with all your packages?

dselivanov commented 8 years ago

Because cosine === dot product, what can be simpler=) IMHO issue is with subsetting of sparse matrix. Also worth to try to sort indices: setkey(indices, i1, i2).

I use these packages at work. But actually it is more like a hobby at the moment. I like to write efficient code. Since R is my primary tool for analysis/machine learning and it lacks good package for text mining, I decided to develop text2vec. Implementation of GloVe was the challenge for myself. I tried to push myself out of comfort zone and try:

get expertise with low-level parallel programming in C++ (something more complex than trivial loop with openMP).
implement some non-trivial algorithm from scratch.

pommedeterresautee commented 8 years ago

You are right, R is missing text mining tools. Hopefully, it will change with you and few other commited authors!

Regarding the cosine computation, below are my tests and the results:

convert DTT matrix to triplet sparse matrix -> very bad idea, increases size and bad impact on performances
remove dimnames + value names of the DTT -> decreaze size from 1.7 Gb to 1 Gb. Very good idea! In any case, I understand why DTT has column names with words (useful for debugging and so on), I am not sure why it has row names, and, worse, why each value has a string name? I don't see how it may be useful -> it increases the size of the sparse matrix. May be, during construction in text2vec, it should be avoided?
increasing btach size: very good effect on performance (may be the most impressive)
computing the indices in a lazy way -> in the code I posted above id.splitter is called before the computation and generates a huge list (600Mb) with all indices to test. Now it just creates a small vector with the top indice of each batch, the whole vector of indices is generated only when required. This change had a big impact on the memory (apparently the whole was copied on each thread)
changing mc.preschedule = FALSE -> this had no apparent effect (may be because the list of indices is generated lazily in the new version of the computation)
sorting indices in data.table with setkey -> no apparent effect on the performances
rm(sign_mat) -> no apparent effect on the performances/memory size (of course deleting decrease memory, but it doesn't seem to be copied for each thread)
calling gc() during the computation on 25% of threads -> doesn't impact performance too badly and helps in not exploding memory

Hope this report will be useful to you.

Kind regards, Michaël

dselivanov commented 8 years ago

Thanks for sharing.

I understand why DTT has column names with words (useful for debugging and so on), I am not sure why it has row names, and, worse, why each value has a string name? I don't see how it may be useful -> it increases the size of the sparse matrix. May be, during construction in text2vec, it should be avoided?

We keep document ids in rownames. It is useful in many cases - subsetting by name, manual inspection, etc. Unfortunately we can keep them only as strings - limitation of dgCMatrix.

pommedeterresautee commented 8 years ago

I was mainly refering to dtm@x names. Just tested on a new dataset and this vector has no name. I should check my own code to see what is happening.

dselivanov commented 8 years ago

If you are talking about names(dtm@x) - it is definitely strange!

pommedeterresautee commented 8 years ago

Yeah, I had names there. But I am doing lots of treatment on the matrix, may be it's one of them... or my custom post processing function. Need to check.

pommedeterresautee commented 8 years ago

Ok names on values is my fault (the way I do the regularization of the matrix).

dselivanov commented 7 years ago

@pommedeterresautee latest version brings a lot of improvements for sparse input, would be interesting to hear from you (in case you will have chance to try on some task). I use it for looking for similar rows in large sparse matrix (~ 1m rows, 100k columns) - works quite nice. For dense data will be noticeably less efficient - will fix in future.

pommedeterresautee commented 7 years ago

Looking at the commit, how much the memory peak has changed?

dselivanov commented 7 years ago

hard to say, I think 10-500x. This is for cosine only. jaccard is the same as was before.

dselivanov / LSHR

validate.r can't work #8