Closed czhang03 closed 8 years ago
If we're hoping that release 3.0 will come this summer, we'll need to figure our what exactly was changed for release 2.5. There are some important UI improvements in the works (Bootstrap layout, improved data tables, and much improved Select interface), but I'm not sure whether we should wait until they are finished for version 2.5 or make them part of version 3.0.
I think once we started 3.0 development, the master branch will become unstable, therefore I think that a release with the downloads will give people a stable version with as much feature as possible (dev branch can solve this issue too, but that is kind of complicated.
thanks for raising the issue, cheng; i agree that we should make a v2.x release before the summer work starts; i am spending the next few days at a conference and will spend some time on Lexos (at least, at a macroscopic level, e.g., i am studying k-means so i can better "explain" it to others);
but "releasing" makes me alil' nervous for at least 3 items: (0) i really like scott's new work on UI; i know it is not "completely done"; but when are we ever "done"? could we release now (very soon) given the present state? (note i am talking about releasing scott's bootstrap version (https://github.com/scottkleinman/Lexos-Bootstrap) (1) similarity query is just too slow; i have not had time to address this; not sure it should even be included? (2) top-words has a Beta flag on it (rightly so); should this still be included?
I will have no problem for top-word to be commented out if need to. That will be you guys decision.
or, we can just comment out the link, and address that on the release message, if someone really need it, they can go to /topwords
, nevertheless I think that topwords has a far more solid math foundation than greyword did...
i'm not completely unhappy with the Beta label (so keep it there, with that "in progress" label);
i wish i understood topwords better; i do hope to learn more ...
Just some further thoughts...
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(dtm)
dist[1, 3] # the distance between doc1 and doc3
Someone just needs to integrate it.
k-means: i will try to devote some time to k-means this week and next;
similarityQuery: i'll confess to not looking how we are presently doing it; this is a reasonable summer assignment for a pair of students, that is, to integrate the faster method and test
topWords: i agree, topWords is really important and a welcome addition; i am just not up to speed on how it works and how to effectively apply it; i think "Professor Cheng" can set us up to address these concerns early in the summer
Bootstrap-Lexos: an April release is a good (internal) deadline;
LOL.
for similarityQuery, we basically just follow the tutorial in genism. I think is will be hard to speed it up, until we found a new lib that can do this.
for TopWords, I think I can give you a better math mathematical explanation now, but I am not sure that the English guys will like that... Those method we uses is pretty "intro stats", but those stuff are used over and over again in all kinds of research(psych, medical), that is why I think the math foundation is strong. I wish I can provide more advanced method to that, like the original one. Nevertheless I am just a freshman at that point, so I hope I can do better this time. (one of the main incentive for me to keep working on Math Stats is to build a better TopWord and GreyWord... LOL)
(please don't tell Prof. Khan that...
I just saw Scott's comment.
I personally are pretty skeptical of that solution. I remember Scott brought that up last summer, but to my best knowledge that code just do a cosine similarity check, which can be found on clustering analysis. therefore I don't think that is very useful. What SimQuery does more is that it do a Latent Space Transformation(which I don't completely understand. From my knowledge now I think it is just removing the none-crucial counts in all the word counts).
Therefore I am not sure that is a great idea to change the SimQuery to a simple cosine-similarity query.
It is true that the Dariah-DE tutorial just does a cosine similarity check. You have the option to cluster the results in the Hierarchy screen, but you could also loop through the documents and output the distance between as a ranked list without clustering. That's essentially what the output of Similarity Query gives you (along with a cosine similarity measure, which you can't get in Hierarchy). If this is all we need, we don't need gensim.
The Latent Space Transformation in gensim may explain why it takes so much longer. As I understand it, gensim "transformations" change the count matrix to another kind of matrix, such as tf-idf. If I understand this tutorial correctly (and assuming it's the one used for building the Lexos tool), gensim applies some latent semantic analysis (similar to topic modelling) to allow for queries against individual words of phrases. That is, if you query the words "Beowulf maþelode" (Beowulf spoke), you can get a ranking of the documents that most closely match the query terms. Essentially, it's an information retrieval model. So this may not be necessary for Lexos. In terms of the process, it seems to take into account documents that don't share the same vocabulary (full explanation here), but I haven't had time to read the whole thing.
If the gensim method is so slow that the browser hangs, it may not be a good option for a web-based tool. On the other hand, if we think the method is important enough, I think it is important to (a) put a warning in the UI and (b) make sure that the method is not presented as a black box. We've somehow got to describe it in simple terms in In the Margins...
I think that is reasonable, and the implementation will be much more easier than what we have now, and all we need to change is the processor.
Another thing, I think we need to give people a little more option than cosine-sim, like we did in clustering
Just to clarify your last point, do you mean providing options like (a) rank by Euclidean distance, (b) rank by cosine similarity? I can't think of any other likely metrics, but presumably, we could give the option to transform the dtm counts to proportional or weighted (tf-idf) counts before calculating the distance matrix. That seems to me to be a nice enhancement (and relatively easy to implement). The value would be similar to the offering of multiple methods for clustering--you can compare the results.
I think this will be the list:
tf-idf is simply too painful, since no body knows what do they do, and how to use them.
but for release 2.5 I think we can just stay at cosine. Other options can wait for 3.0
So switching Similarity Query from the gensim to the simpler scikit-learn method is a goal for release 2.5, and integrating multiple metrics for 3.0. Did I understand that right?
That is exactly what I think.
this is a duplicate, see more through discussion on https://github.com/WheatonCS/Lexos/issues/280
https://github.com/WheatonCS/Lexos/releases after one year, this is still a draft, and no download is provided.