amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.01k stars 2.31k forks source link

I can't get "R&D" and "SMEs" in the same word cloud #769

Open aarre opened 1 week ago

aarre commented 1 week ago

Description

I work in a field where "R&D" (research and development) and "SMEs" (small and medium enterprises) are important concepts. If I tokenize myself, then Word Cloud displays "r&d" correctly but does not display "smes" even though I have verified they are prominent in my frequency count. If I let Word Cloud tokenize, then it displays "smes" correctly but renders "R&D" as "r d" (i.e., with a space where there should be an ampersand). Is there anything I can do?

Steps/Code to Reproduce

Example:


    with open(text_path, 'r', encoding='utf-8') as file:
        text = file.read()

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    # This shows "smes" and "r d" (but no ampersand)
    cloud = wordcloud.WordCloud(width=1920, height=1080,
                                background_color='white',
                                stopwords=stop_words.STOP_WORDS,
                                font_path="./assets/fonts/roboto/Roboto-Regular.ttf").generate(text)

    # This shows "r&d" but not "smes"
    cloud = wordcloud.WordCloud(width=1920, height=1080,
                                 background_color='white',
                                 stopwords=set(),
                                 font_path="./assets/fonts/roboto/Roboto-Regular.ttf").generate_from_frequencies(frequencies)

Expected Results

Either way, I should be able to get both "smes" and "r&d" on the same Word Cloud.

Actual Results

As described above, in one case I get "smes" and "r d", and in the other case, I get "r&d" but no "smes".

Versions

Windows-11-10.0.22631-SP0 Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] NumPy 1.26.4 matplotlib 3.9.0 wordcoud 1.9.3

aarre commented 6 days ago

Here is a fully functional example. The version where I calculate my own frequencies works fine, but it is clear that there is an issue with the version where I send the full text to Word Cloud.

#!venv/bin/python

import collections
import matplotlib.pyplot as plt
import wordcloud

if __name__ == "__main__":

    text = " ".join(["sme"] * 10 + ["r&d"] * 10)

    frequencies = collections.Counter()
    for word in text.split(" "):
        frequencies[word] += 1
    frequencies = dict(frequencies)

    cloud = wordcloud.WordCloud().generate(text)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('text')
    plt.show(bbox_inches='tight')

    cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
    plt.imshow(cloud)
    plt.tight_layout(pad=0)
    plt.axis('off')
    plt.title('frequencies')
    plt.show(bbox_inches='tight')

img_text

img_frequencies