Open bmabey opened 9 years ago
Looks like it is pretty straight forward to fix.. a number of selects are selecting the top element with a certain class or all of them without taking into account the parent visID
. For example: https://github.com/cpsievert/LDAvis/blob/master/inst/htmljs/ldavis.js#L1005
Thanks for the example. Unfortunately, my initial thought is that it will be hard to do this right. The event handling is programed with a single visualization in mind (notice how selections for the lower visualizations don't work). For example, suppose I want to link to a selection of topic 12 and a lambda value of 0.53:
http://benmabey.com/LDAvis/#topic=12&lambda=0.53&term=
Should that selection apply to all the visualizations? I suppose you could come up with unique identifiers for each viz, but this might be more work than it's worth (at least for me).
Actually, this might not be as bad as I first thought, but I'm quite busy at the moment. Pull request are more than welcome!
We ended up fixing this in our own version. Multiple visualizations are now possible on a single page however we punted on allowing multiple viz state from going into the URL. Right now any interaction to any viz will result in updating the same global settings. We could prepend the selections with the viz id (which we had to add anyways) but decided not to hassle with that now.
Anyways, the changes we made are living in the python port repo if you are interested in incorporating them. Which, BTW, I finally packaged up the python version so you can check it out:
https://github.com/bmabey/pyLDAvis
http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
Nice! Can your version handle multiple visualizations in a standalone HTML page? I would certainly consider a pull request! That way if any more improvements are made, we can both benefit :)
It should handle multiple visualizations in one page no problem. Most of the change involved organizing the d3 element ids & class names.
Awesome! I would love a pull request!
Hey Ben,
This is awesome; thanks for porting LDAvis to python! Sorry I haven't chimed in until now. Really happy to see this project cross the boundaries of programming languages.
One thing about your demo page:
The second viz seems to have a slight bug. With lambda = 1, the terms should be sorted in decreasing order of probability, i.e. the widths of the red bars should be non-increasing from the top bar down to the bottom bar. But in most of the topics, among the top-30 terms, the red bars are not sorted in this order -- sometimes a narrow red bar is above a wider red bar. Is it possible that the input in this case is not correct? We've seen this a few times in our work, too, and usually the bug is something simple like an incorrect ordering of the terms of the vocab, or some other data processing glitch.
I haven't used gensim before, so until I learn it, I can't see what the problem is for myself, but hopefully it's a trivial fix.
Anyway, overall it looks great!
-kenny
Thanks @kshirley for the kind words and pointing out this bug.
The good news is that I have verified that the bug is somewhere within the gensim data processing since LDAvis
gives me the same erroneous visualization when given the same input. The bad news is that after spending a few hours investigating this bug I'm still at a loss as to where the issue is. (I've checked the cases you suggested.) I'll continue to investigate it and push a fix once I hunt it down.
Alright, sounds like a doozy...
Is it possible that some of the small probabilities are being represented in scientific notation? Just a wild guess (I think this once caused problems for us in the past).
Let me know if you figure it out, and maybe in a few days (after some near-term deadlines) I can have a look myself by checking out your notebook. Good luck!
I've been looking at the gensim code some more. With @cscorley's help I was able to make the extraction of data more straight forward. With the changes I'm even more confident that what I'm doing is correct and yet the visualization is still showing the irregularities (i.e. the estimated freq. counts and the probabilities are not perfectly correlated). At this point I don't think I'll be investing any more time into it and will wait until a more experienced gensim user can help track down the issue.
Hi @bmabey , @cpsievert,
I found the bug that's causing the problems with Ben's viz -- it's in the createJSON() function of LDAvis.
Here's a quick summary:
(1) In line 156 of R/createJSON.R we compute the topic frequencies:
topic.frequency <- colSums(theta * doc.length)
whose sum is N, the total number of tokens in the (training) data.
(2) Then, on line 181, we multiply phi (the matrix containing topic-term distributions) by the topic frequencies, so that we can visualize counts rather than conditional probabilities (under the theory that the counts provide more info, and people can naturally get a feel for signal vs. noise given the counts; 400 occurrences out of 800 is different from 4 out of 8):
term.topic.frequency <- phi * topic.frequency
A problem arises, though, because now the sums across terms are not equal to the original term frequencies. The terms with high probabilities in rare topics have their frequencies effectively downweighted, and the terms with high probabilities in common topics have their frequencies inflated.
(3) We correct for this right away, in lines 183-185, so that the sums across terms in the frequency matrix (called topic.term.frequency) are adjusted to be equal to the term frequencies supplied by the user in the vector "term.frequency":
err <- as.numeric(term.frequency/colSums(term.topic.frequency))
# http://stackoverflow.com/questions/3643555/multiply-rows-of-matrix-by-vector
term.topic.frequency <- sweep(term.topic.frequency, MARGIN=2, err, `*`)
I thought this was OK, but Ben's example shows the problem in doing this: the order of these adjusted frequencies is now different from the order of phi. It is subtle in most cases, but in Ben's model fit, practically every topic has visible cases where, with lambda = 1, we see that the words are properly listed in decreasing order of relevance (i.e. decreasing order of probability because lambda = 1), but the pink bar widths are computed using the adjusted frequencies, which are not quite in the same order.
I'll paste it here again. Viz #2 on the page, Topic 3 with lambda = 1 has some especially visible problems:
Solutions: in the short term, we can
...or
In the long term, I think we may have to ask users for the values of the priors (usually denoted alpha and beta) that they used in the model fit in order to compute a version of phi that we can visualize that is on the frequency scale that is consistent with both (1) the topic frequencies and (2) the original term frequencies.
Well... that was a long one. First off - which of the short-term solutions do you prefer?
Let me know what you think, if you have Q's, etc. I can provide more examples. To figure this out I read Ben's python LDA fit into R and just walked through the createJSON() function. After computing the object we call "tinfo" it was apparent.
-k
Thanks @kshirley, I'll have to think about it. I'm swamped this week, but I'll respond next week!
Since this issue deals when what pull request #39 solves I'm moving the conversation around the new bug to issue #40.
When more than one visualizations are displayed on a single page most of the controls for the bottom visualization effect the top/first vis instead of their own elements.
Click here for an example.
I or @log0ymxm will submit a PR for this sometime next week unless you beat us to it. :)