dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
852 stars 136 forks source link

Reimplement createJSON() from LDAvis #233

Open dselivanov opened 6 years ago

dselivanov commented 6 years ago

Seems that LDAvis package doesn't actively maintained and won't be updated on CRAN in near future. In particular we need option to not reorder topics and fixes for NaN in jensenShannon (see https://github.com/cpsievert/LDAvis/issues/56):

  1. https://github.com/cpsievert/LDAvis/pull/77
  2. https://github.com/cpsievert/LDAvis/pull/80
manuelbickel commented 6 years ago

With respect to the Jensen Shannon divergence I think that the fix proposed by Maren-Eckhoff and pending as open pull request already solves the problem. See adapted function and test below.

There was one last comment in above mentioned issue 56 about still getting NaN, however, without providing an example. At least to my understanding, there should be no NaNs as far as the input data is fine - which it should be at this point. (please correct me if I am wrong)

#adapted jensenShannon
jensenShannon <- function(x, y) {
    m <- 0.5*(x + y)
    #introduced fix proposed by Maren-Eckhoff to avoid log(0)
    #https://github.com/cpsievert/LDAvis/issues/56
    0.5*(sum(ifelse(x==0,0,x*log(x/m)))+sum(ifelse(y==0,0,y*log(y/m))))
}
#create phi for testing
p <-     c(0.25,   0, 0.25, 0,0.5)
q <-     c(   0,0.25, 0.25, 0,0.5)
zeros <- c(   0,   0,    0, 0,  0) #this does not make sense, since row should some up to one, just for demo
phi <- rbind(p, q, qrev = rev(q), prev = rev(p), zeros)
#       [,1] [,2] [,3] [,4] [,5]
# p     0.25 0.00 0.25 0.00 0.50
# q     0.00 0.25 0.25 0.00 0.50
# qrev  0.50 0.00 0.25 0.25 0.00
# prev  0.50 0.00 0.25 0.00 0.25
# zeros 0.00 0.00 0.00 0.00 0.00
dist.mat <- proxy::dist(x = phi, method = jensenShannon)
pca.fit <- stats::cmdscale(dist.mat, k = 2)
# [,1]       [,2]
# p      4.600278e-02 -0.1037688
# q      2.600304e-01 -0.0176260
# qrev  -2.600304e-01 -0.0176260
# prev  -4.600278e-02 -0.1037688
# zeros  2.073058e-16  0.2427896
dselivanov commented 6 years ago

True, but

  1. PR was not merged yet
  2. I doubt maintainer will upload updated package on CRAN in near-future
manuelbickel commented 6 years ago

Maybe my comment was misleading, sorry. I agree that LDAvis will have to be reimplemented, just wanted to confirm that the fix works for this purpose. Hence, in the first step a modified copy of createJSON might quickly solve the issues raised above in terms of creating the data for visualization. Another thing is, of course, the potential reimplementation of visualization itself.