amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.07k stars 2.31k forks source link

Add support for generating the word cloud from array-like of labels #271

Closed soupault closed 7 years ago

soupault commented 7 years ago

I.e. something like generate_from_array(array), where array is supposed to be an array-like with labels: ('a', 'b', 'c', 'a') / ['a', 'b', 'c', 'a'] / np.array([1, 2, 3, 2]). The counting is meant to run under the hood (using collections.Counter, for example).

Please, let me know if you would be interested to have this feature. If so, I'll work on the implementation.

P.S. Thank you for the great tool :)

amueller commented 7 years ago

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way. Can you maybe give an example for your usecase?

soupault commented 7 years ago

@amueller

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.

Yes, I'm proposing for wordcloud to take care of this in the case of array-like input (which is an often case, I assume). Notice, that WordCloud().generate_from_text, substantially, is implemented in a similar way, and performs counting internally.

Can you maybe give an example for your usecase?

Basically, I'm doing multi-label classification, and averaging predictions over the test set for visualization purposes. So I run inference on a list of samples, collect the results in a list of lists (each sublist stores the predicted labels for a single sample in an order of decreasing confidence indices), flatten the outer list, count the number of occurencies of each label, build WordCloud.

This could also be applied to a multi-class classification problem for recommender systems, where the one of usecases is to explore the top3/top5/topN predictions over the test set.

amueller commented 7 years ago

Ah, so a list with repetitions. You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies.

Sent from phone. Please excuse spelling and brevity.

On Jun 8, 2017 09:44, "Egor Panfilov" notifications@github.com wrote:

@amueller https://github.com/amueller

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.

Yes, I'm proposing for wordcloud to take care of this in the case of array-like input (which is an often Notice, that WordCloud().generate_from_text, substantially, is implemented in a similar way.

Can you maybe give an example for your usecase?

Basically, I'm doing multi-label classification, and averaging predictions over the test set for visualization purposes. So I run inference on a list of samples, collect the results in a list of lists (each sublist stores the predicted labels for a single sample in an order of decreasing confidence indices), flatten the outer list, count the number of occurencies of each label, build WordCloud.

This could also be applied to a multi-class classification problem for recommender systems, where the one of usecases is to explore the top3/top5/topN predictions over the test set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amueller/word_cloud/issues/271#issuecomment-307025909, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFiFul_Em_S-Jm-0TkwnWwA7Qb_NVks5sB6ZfgaJpZM4NyuPi .

amueller commented 7 years ago

I recommend the second as that will bypass the tokenization, as you already know what the tokens are supposed to be.

Sent from phone. Please excuse spelling and brevity.

On Jun 8, 2017 09:52, "Andreas Mueller" t3kcit@gmail.com wrote:

Ah, so a list with repetitions. You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies.

Sent from phone. Please excuse spelling and brevity.

On Jun 8, 2017 09:44, "Egor Panfilov" notifications@github.com wrote:

@amueller https://github.com/amueller

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.

Yes, I'm proposing for wordcloud to take care of this in the case of array-like input (which is an often Notice, that WordCloud().generate_from_text, substantially, is implemented in a similar way.

Can you maybe give an example for your usecase?

Basically, I'm doing multi-label classification, and averaging predictions over the test set for visualization purposes. So I run inference on a list of samples, collect the results in a list of lists (each sublist stores the predicted labels for a single sample in an order of decreasing confidence indices), flatten the outer list, count the number of occurencies of each label, build WordCloud.

This could also be applied to a multi-class classification problem for recommender systems, where the one of usecases is to explore the top3/top5/topN predictions over the test set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amueller/word_cloud/issues/271#issuecomment-307025909, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFiFul_Em_S-Jm-0TkwnWwA7Qb_NVks5sB6ZfgaJpZM4NyuPi .

soupault commented 7 years ago

@amueller

You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies.

Of course, but there is an overhead in both cases (and, frankly speaking, in my pipeline as well): creating/spliting a potentially large string in the first, pandas dependency and its containers in the second.

Going back to the original question :) : would you like to see such kind of input supported by wordcloud out of the box? To me, the current generate_from_text looks like a special case of considered generate_from_array, and could be built on top of the latter.

amueller commented 7 years ago

I'm weary of adding too many interfaces. You an also implement value_counts in your own code in three lines, which is exactly the code you'd add to wordcloud:

d = defaultdict(int)
for word in array:
    d[word] += 1

What's the problem with adding those to your code? This is what process_tokens does, but it also does other processing that you don't want.

soupault commented 7 years ago

No problems at all, it is already implemented in such way :). I was just wondering if that is a common enough case.

Thank you very much for the feedback! Closing as wontfix.