cpsievert / LDAvis

R package for web-based interactive topic model visualization.
Other
557 stars 131 forks source link

dist() with jensenShannon returns Nan #56

Closed lmkirvan closed 6 years ago

lmkirvan commented 8 years ago

I really enjoy this package and appreciate your work on it. I've previously used it successfully, but updated the package recently and now get an error that I previously had not encountered.

I can't quit figure out why (as the jensen Shannon distance function looks okay) but

`jensenShannon <- function(x, y) { m <- 0.5_(x + y) 0.5_sum(x_log(x/m)) + 0.5_sum(y*log(y/m)) }

dist.mat <- proxy::dist(x = parems$phi, method = jensenShannon)`

returns Nan using phi.

kshirley commented 8 years ago

Hi - I'm glad you're finding the package useful!

Regarding the NaN - is it possible you have a NA value in phi? Also - a long shot here - did you mean to write params$phi rather than parems$phi, i.e. just a typo?

If you want to share the data to make the error reproducible, that might also help us troubleshoot.

-k

lmkirvan commented 8 years ago

I can save the values of phi if that would be helpful. Let me know and I can send it to you via email, or updload to a github repo. The phi I'm using does not include any NA values and all rows sum to 1. There are several zero values (because of rounding), but I understood that wouldn't be a problem. I think that it's a problem with the distance function as written.

jsPCA2<- function (phi) { jensenShannon2 <- function(x, y) { m <- 0.5 * (x + y) 0.5 * sum(x * log(x/m)) + 0.5 * sum(y * log(y/m)) } dist.mat <- proxy::dist(x = phi, method = 'Jaccard') return(dist.mat) pca.fit <- stats::cmdscale(dist.mat, k = 2) data.frame(x = pca.fit[, 1], y = pca.fit[, 2]) }

As you can see, I edited the jsPCA function and using another distance metric (chosen at random) does not return NaN.

I've also spotted a question on SO that looks like someone is experiencing a similar problem.

http://stackoverflow.com/questions/35830008/r-ldavis-k-2-createjson-error

Let me know if you'd like the phi file.

Thanks for you help.

-L

On Fri, Apr 22, 2016 at 3:07 PM, Kenny Shirley notifications@github.com wrote:

Hi - I'm glad you're finding the package useful!

Regarding the NaN - is it possible you have a NA value in phi? Also - a long shot here - did you mean to write params$phi rather than parems$phi, i.e. just a typo?

If you want to share the data to make the error reproducible, that might also help us troubleshoot.

-k

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/cpsievert/LDAvis/issues/56#issuecomment-213556822

MarcinKosinski commented 8 years ago

NaN are returned when you have 0 values in phi matrix. That's why you have to add constant to every value in phi matrix, like it is done in tutorial.

Maren-Eckhoff commented 7 years ago

Hi Marcin, thanks for the great package. I think the solution should not be to add a constant. The problem appears because R sets 0*log(0) as NaN. But mathematically, the limit of x log(x) for x to 0 is 0. Therefore, the summand in the jensenShannon metric should be 0. For example, you could replace

sum(x * log(x/m))

by

sum(ifelse(x==0,0,x * log(x/m))

Best, Maren

MarcinKosinski commented 7 years ago

@Maren-Eckhoff that's great solution.

I'm not the owner of the package but @cpsievert is and might would like to know this improvement.

meftasadat commented 7 years ago

I have encountered the same issue,

Then, I applied the fix mentioned above by @Maren-Eckhoff (thanks!). It works in most cases but fails in some cases as well, returning the error infinite or missing values in 'x' by the method jsPCA