WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
120 stars 20 forks source link

Word Cloud generating different clouds from the same data #149

Closed scottkleinman closed 8 years ago

scottkleinman commented 9 years ago

I have been sent some data from State of the Union speeches that is not resulting in consistent word clouds in the Word Cloud tool. Each time the graph is rendered, the JSON object in the dataset variable is the same, but the generated SVG code differs in which words are displayed. To give examples:

Graph 1: states: 6911, congress: 5862, etc. Graph 2: government: 7549, states: 6911, congress: 5862, etc.

As you can see, Graph 1 skipped "government". I have submitted the same data to Jason Davies original Word Cloud Generator (http://www.jasondavies.com/wordcloud/), and it does not produce the same effect. So something is going on in our rendering of the graph.

scottkleinman commented 9 years ago

Some extra information. I added console.log(cloud.length) after line 105 of scripts_wordcloud.js and then ran Word Cloud in Firefox. It consistently tells me that my data cloud has 27880 objects but it also flashes up a number of repeats, which is inconsistent. From what I can tell, "government" only shows up in the graph if the number of repeats is higher. I'm not sure what the repeats are (not words in the dataset--I've checked that).

scottkleinman commented 9 years ago

I have now identified the problem. High frequency words are dropped if they cannot fit within the layout. Sometimes re-generating the cloud will reveal the words since each new cloud has a different layout. But not always. Switching from the default log n to √n (sqrt) scale to n (linear) improves the results but still does not display high frequency words 100% of the time. I have found that adjusting the size of the word cloud and the scale of the contents fits more words, but even that is not a guarantee that every word will be included in every data set. And we'd have to build in user-defined size/scale functions.

I think a better approach is to experiment with ideal scaling for different sized data sets and autodetect the best fit for the data. That would produce consistently better word clouds, but would not entirely solve the problem. In the Margins should discuss the implications of this limitation and direct users to BubbleViz, which is probably better suited to visualising large amounts of data.

Discussion of the issue can be found at the following links:

https://github.com/jasondavies/d3-cloud/issues/36 https://github.com/jasondavies/d3-cloud/issues/19 https://github.com/jasondavies/d3-cloud/issues/17

scottkleinman commented 9 years ago

Belatedly, another idea is to include a small window where the user can scroll through word count table. If the most frequent word(s) in the table do not appear in the cloud, then the user will immediately know and they can turn to In the Margins to find out why.

scottkleinman commented 9 years ago

A word counts table has been added to Word Cloud so that the user can see if a word has been omitted. Discussion of this issue and ways to customise the layout need to go in In the Margins. But, short of modifying the d3 algorithm, there's not much more we can do, so I'm closing this issue for now.

mleblanc321 commented 9 years ago

nice job

scottkleinman commented 8 years ago

As far as I can tell, there has been no progress on this in d3.layout.cloud(). We might try implementing one of the workarounds in the discussion at jasondavies/d3-cloud#36. Failing that, it might be good to go one better than my previous solution by automatically displaying a table with the 3 most frequent words and the 3 longest words. That way, the user can easily see if their word cloud is faulty.

Also, the problem may be more extreme in multicloud since the layout area is much smaller. I'm not exactly sure how we deal with this.

kreddy95 commented 8 years ago

Is this still a bug?

scottkleinman commented 8 years ago

Alas, yes. I've added a new In the Margins label to remind us to document this.

scottkleinman commented 8 years ago

I have added this issue to our In the Margins notes. Since it doesn't seem like this problem will be addressed in d3.js any time soon. I think we can close this issue.