apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.6k stars 13.45k forks source link

Tokenize text when making a word cloud #9672

Closed jzonthemtn closed 4 years ago

jzonthemtn commented 4 years ago

Is your feature request related to a problem? Please describe. When using the word cloud on fields that contain sentences, the word cloud treats each field as a "word" in the word cloud. The word cloud then contains each sentence.

Describe the solution you'd like Have an option to tokenize (or simply split on whitespace) the words in a field so the word cloud will be able to make counts for each individual word in the field.

Describe alternatives you've considered Changing how data gets ingested into the database but I don't see a good solution from that angle if the word cloud is expecting one word per field.

Additional context None.

villebro commented 4 years ago

This is an interesting idea @jzonthemtn . To make sure the proposed feature is as generic as possible, do you have any suggestions for tokenization options? I'm thinking how to handle periods, commas, special characters etc?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

jzonthemtn commented 4 years ago

@villebro My recommendation would be to strip punctuation and then split on whitespace. This would work well for my use-case. If that is not sufficient for a user then I would suggest they do any necessary preprocessing of the text prior to saving it in their database so that way they have control over how they want to handle the tokenizing.

ktmud commented 4 years ago

While this feature would definitely be useful, it’s also pretty easy to create virtual data sources that split strings and explode arrays into rows:

https://stackoverflow.com/questions/51063730/split-one-row-into-multiple-rows-based-on-comma-separated-string-column

https://stackoverflow.com/questions/17942508/sql-split-values-to-multiple-rows

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.