Closed tarmstro closed 9 years ago
This scikit-learn section is really badly written, so it is hard to see how this might be applied. If I understand correctly, some of the metrics require prior cluster structure to test against. That will not generally be the case with the types of experiments most users will be doing. Last year, I floated some ideas about using entropy to deal with linguistic variation, and seems like there might be some potential there. See section (2.3.8.4.).
Is there any case for adding a function for normalising word counts in the document (or segment)-term matrix? See the discussion under Comparing Texts here. Or is that built into the Cosine distance metric already in Lexos?
I agree that we should rethink how we do dendrograms, and I think there're some cool alternatives when it comes to how we display it besides a simple png.
However, we should keep in mind that it is insanely beneficial for the researching half of Lexomics that we keep the legend very much imbedded in the dendrogram output, so that when they print one they get the other as well.
Scott, what do you mean when you say the scikit-learn section is badly written? Are you referring to a portion of our tool, or to the library itself?
Bryan, I was referring to the cluster performance evaluations for which Tom provided the link.
I agree about the usefulness of the legend, though you never know how people will use the output. It might be worth adding a feature to toggle it on and off.
I also agree about exploring alternative to png. d3 has some new visualisations that might be useful. In my dreams, the ideal would be a combination of the Drag and Drop, Zoomable, Panning, Collapsible Tree with auto-sizing and the tool-tip enabled nodes of the 2013 Federal Budget visualisation (with tooltips containing statistics or even nodes triggering dialogs showing the segment's text). The main barrier is just massaging the cluster object into a json string or csv accessible by d3.
Cluster evaluation where the user has no label information will be an easy feature to add. This can be trivially concatenated to the legend text. For users or scenarios where class labels are known, we can evaluate the clustering results in more robust ways. Even if those users are rare, we can use known class labels on testing data to show end-to-end performance of the tool.
RE: normalization. I'll have to check again, but I don't think there's anything like tf-idf weighting happening or any dimensionality reduction either. Is that something you're interested in, Scott? If so, I'm very much interested in making that happen.
tf-idf weighting. A simple answer is yes, I'd love to have this functionality, and others will demand it. But I think there is a ui/ux issue here. Playing devil's advocate to my earlier question, are functions like these going to overwhelm our users? If we have them, I think it might be best to put them in an "advanced" section. However, there might be a case for developing them as plugins first to see how much usage we actually get out of them. That would have the further advantage of helping us to develop data export/import functionality (right now, you can't download a word counts csv and then upload it back to the system; you have to re-upload the texts and then re-generate the data).
(nice thread) perhaps we can consider leaving the vanilla dendrogram tool as is, but also add a richer tool of the type that addresses some of tom's suggestion(s) of cluster evaluation? (that said, i agree that we could attempt a full rewrite of the dendrogram tool with d3-like features);
[scott: we really do want to allow users to upload .csv data, but we should discuss when and where in the workflow, e.g., i hate to mess with the simple opening upload page (for texts), so perhaps the uploading of .csv files could happen on appropriate analysis pages?]
As Tom mentions, it's trivial to tack some cluster evaluation on the end of the legend, assuming that we can run it easily before generating the dendrogram image.
I've never really understood the code to generate an image file from d3 (it's used in the Wordcloud function), but I'm not sure it would be worthwhile for dynamic visualisations like the examples I gave. So d3 is really better for onscreen exploratory tools in separate functions.
On uploading csv data (and we should likely support json as well), I agree that a lot of discussion about workflow is needed. But we'll also need to think a lot about file management. I've already prototyped one possibility in adapting Multicloud to read Mallet data to produce topic clouds. The csv data is uploaded as a text file, and the topic cloud tool gives you an option to select the file data file from your files list. Of course, you could also use Select to disable all the other files. The main problem is that it (currently) does no error checking in case you select a file in the wrong format. In terms of workflow, i think this is a better option than having separate uploads for different tools, but it probably means that we should extend uploadable file types to .csv and .json, and work out a system for ensuring that certain file types are not fed to tools that cannot read them.
Much of this discussion has been implemented. Tom's original link to cluster performance evaluations has been transferred to the Lexos road map document. So I think we can close this issue.
I'd like to expand on the text/descriptive output of the dendrogram to include cluster performance evaluations. Dendrogram to start, but then potentially other methods. This is in addition to the legend output include in the figure. We may want to rethink where and in what format this meta information is shared.