WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
120 stars 20 forks source link

I think we may need to release 2.5 #272

Closed czhang03 closed 8 years ago

czhang03 commented 8 years ago

https://github.com/WheatonCS/Lexos/releases after one year, this is still a draft, and no download is provided.

scottkleinman commented 8 years ago

If we're hoping that release 3.0 will come this summer, we'll need to figure our what exactly was changed for release 2.5. There are some important UI improvements in the works (Bootstrap layout, improved data tables, and much improved Select interface), but I'm not sure whether we should wait until they are finished for version 2.5 or make them part of version 3.0.

czhang03 commented 8 years ago

I think once we started 3.0 development, the master branch will become unstable, therefore I think that a release with the downloads will give people a stable version with as much feature as possible (dev branch can solve this issue too, but that is kind of complicated.

mleblanc321 commented 8 years ago

thanks for raising the issue, cheng; i agree that we should make a v2.x release before the summer work starts; i am spending the next few days at a conference and will spend some time on Lexos (at least, at a macroscopic level, e.g., i am studying k-means so i can better "explain" it to others);

but "releasing" makes me alil' nervous for at least 3 items: (0) i really like scott's new work on UI; i know it is not "completely done"; but when are we ever "done"? could we release now (very soon) given the present state? (note i am talking about releasing scott's bootstrap version (https://github.com/scottkleinman/Lexos-Bootstrap) (1) similarity query is just too slow; i have not had time to address this; not sure it should even be included? (2) top-words has a Beta flag on it (rightly so); should this still be included?

czhang03 commented 8 years ago

I will have no problem for top-word to be commented out if need to. That will be you guys decision.

czhang03 commented 8 years ago

or, we can just comment out the link, and address that on the release message, if someone really need it, they can go to /topwords, nevertheless I think that topwords has a far more solid math foundation than greyword did...

mleblanc321 commented 8 years ago

i'm not completely unhappy with the Beta label (so keep it there, with that "in progress" label);
i wish i understood topwords better; i do hope to learn more ...

scottkleinman commented 8 years ago

Just some further thoughts...

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(dtm)
dist[1, 3] # the distance between doc1 and doc3

Someone just needs to integrate it.

mleblanc321 commented 8 years ago

k-means: i will try to devote some time to k-means this week and next;

similarityQuery: i'll confess to not looking how we are presently doing it; this is a reasonable summer assignment for a pair of students, that is, to integrate the faster method and test

topWords: i agree, topWords is really important and a welcome addition; i am just not up to speed on how it works and how to effectively apply it; i think "Professor Cheng" can set us up to address these concerns early in the summer

Bootstrap-Lexos: an April release is a good (internal) deadline;

czhang03 commented 8 years ago

LOL.

for similarityQuery, we basically just follow the tutorial in genism. I think is will be hard to speed it up, until we found a new lib that can do this.

for TopWords, I think I can give you a better math mathematical explanation now, but I am not sure that the English guys will like that... Those method we uses is pretty "intro stats", but those stuff are used over and over again in all kinds of research(psych, medical), that is why I think the math foundation is strong. I wish I can provide more advanced method to that, like the original one. Nevertheless I am just a freshman at that point, so I hope I can do better this time. (one of the main incentive for me to keep working on Math Stats is to build a better TopWord and GreyWord... LOL)

(please don't tell Prof. Khan that...

czhang03 commented 8 years ago

I just saw Scott's comment.

I personally are pretty skeptical of that solution. I remember Scott brought that up last summer, but to my best knowledge that code just do a cosine similarity check, which can be found on clustering analysis. therefore I don't think that is very useful. What SimQuery does more is that it do a Latent Space Transformation(which I don't completely understand. From my knowledge now I think it is just removing the none-crucial counts in all the word counts).

Therefore I am not sure that is a great idea to change the SimQuery to a simple cosine-similarity query.

scottkleinman commented 8 years ago

It is true that the Dariah-DE tutorial just does a cosine similarity check. You have the option to cluster the results in the Hierarchy screen, but you could also loop through the documents and output the distance between as a ranked list without clustering. That's essentially what the output of Similarity Query gives you (along with a cosine similarity measure, which you can't get in Hierarchy). If this is all we need, we don't need gensim.

The Latent Space Transformation in gensim may explain why it takes so much longer. As I understand it, gensim "transformations" change the count matrix to another kind of matrix, such as tf-idf. If I understand this tutorial correctly (and assuming it's the one used for building the Lexos tool), gensim applies some latent semantic analysis (similar to topic modelling) to allow for queries against individual words of phrases. That is, if you query the words "Beowulf maþelode" (Beowulf spoke), you can get a ranking of the documents that most closely match the query terms. Essentially, it's an information retrieval model. So this may not be necessary for Lexos. In terms of the process, it seems to take into account documents that don't share the same vocabulary (full explanation here), but I haven't had time to read the whole thing.

If the gensim method is so slow that the browser hangs, it may not be a good option for a web-based tool. On the other hand, if we think the method is important enough, I think it is important to (a) put a warning in the UI and (b) make sure that the method is not presented as a black box. We've somehow got to describe it in simple terms in In the Margins...

czhang03 commented 8 years ago

I think that is reasonable, and the implementation will be much more easier than what we have now, and all we need to change is the processor.

Another thing, I think we need to give people a little more option than cosine-sim, like we did in clustering

scottkleinman commented 8 years ago

Just to clarify your last point, do you mean providing options like (a) rank by Euclidean distance, (b) rank by cosine similarity? I can't think of any other likely metrics, but presumably, we could give the option to transform the dtm counts to proportional or weighted (tf-idf) counts before calculating the distance matrix. That seems to me to be a nice enhancement (and relatively easy to implement). The value would be similar to the offering of multiple methods for clustering--you can compare the results.

czhang03 commented 8 years ago

I think this will be the list: capture

tf-idf is simply too painful, since no body knows what do they do, and how to use them.

czhang03 commented 8 years ago

but for release 2.5 I think we can just stay at cosine. Other options can wait for 3.0

scottkleinman commented 8 years ago

So switching Similarity Query from the gensim to the simpler scikit-learn method is a goal for release 2.5, and integrating multiple metrics for 3.0. Did I understand that right?

czhang03 commented 8 years ago

That is exactly what I think.

czhang03 commented 8 years ago

this is a duplicate, see more through discussion on https://github.com/WheatonCS/Lexos/issues/280