Open aarre opened 1 week ago
Here is a fully functional example. The version where I calculate my own frequencies works fine, but it is clear that there is an issue with the version where I send the full text to Word Cloud.
#!venv/bin/python
import collections
import matplotlib.pyplot as plt
import wordcloud
if __name__ == "__main__":
text = " ".join(["sme"] * 10 + ["r&d"] * 10)
frequencies = collections.Counter()
for word in text.split(" "):
frequencies[word] += 1
frequencies = dict(frequencies)
cloud = wordcloud.WordCloud().generate(text)
plt.imshow(cloud)
plt.tight_layout(pad=0)
plt.axis('off')
plt.title('text')
plt.show(bbox_inches='tight')
cloud = wordcloud.WordCloud().generate_from_frequencies(frequencies)
plt.imshow(cloud)
plt.tight_layout(pad=0)
plt.axis('off')
plt.title('frequencies')
plt.show(bbox_inches='tight')
Description
I work in a field where "R&D" (research and development) and "SMEs" (small and medium enterprises) are important concepts. If I tokenize myself, then Word Cloud displays "r&d" correctly but does not display "smes" even though I have verified they are prominent in my frequency count. If I let Word Cloud tokenize, then it displays "smes" correctly but renders "R&D" as "r d" (i.e., with a space where there should be an ampersand). Is there anything I can do?
Steps/Code to Reproduce
Example:
Expected Results
Either way, I should be able to get both "smes" and "r&d" on the same Word Cloud.
Actual Results
As described above, in one case I get "smes" and "r d", and in the other case, I get "r&d" but no "smes".
Versions
Windows-11-10.0.22631-SP0 Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] NumPy 1.26.4 matplotlib 3.9.0 wordcoud 1.9.3