amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.15k stars 2.32k forks source link

Error when including space into regex #485

Open KonradHoeffner opened 5 years ago

KonradHoeffner commented 5 years ago

Description

word_cloud crashes when the regular expression includes a space.

Steps/Code to Reproduce

Create test.txt with:

common phrase|uncommon phrase|rare phrase
common phrase|uncommon phrase
common phrase
common phrase 
  1. wordcloud_cli --imagefile test.png --regexp "\w[\w]+" --text test.txt works fine
  2. wordcloud_cli --imagefile test.png --regexp "\w[\w ]+" --text test.txt crashes with:
Traceback (most recent call last):
  File "/usr/bin/wordcloud_cli", line 11, in <module>
    load_entry_point('wordcloud==1.5.0', 'console_scripts', 'wordcloud_cli')()
  File "/usr/lib/python3.7/site-packages/wordcloud/__main__.py", line 33, in main
    wordcloud_cli_main(*wordcloud_cli_parse_args(sys.argv[1:]))
  File "/usr/lib/python3.7/site-packages/wordcloud/wordcloud_cli.py", line 89, in main
    wordcloud.generate(text)
  File "/usr/lib/python3.7/site-packages/wordcloud/wordcloud.py", line 605, in generate
    return self.generate_from_text(text)
  File "/usr/lib/python3.7/site-packages/wordcloud/wordcloud.py", line 586, in generate_from_text
    words = self.process_text(text)
  File "/usr/lib/python3.7/site-packages/wordcloud/wordcloud.py", line 563, in process_text
    word_counts = unigrams_and_bigrams(words, self.normalize_plurals)
  File "/usr/lib/python3.7/site-packages/wordcloud/tokenization.py", line 55, in unigrams_and_bigrams
    word1 = standard_form[bigram[0].lower()]
KeyError: 'common'
KonradHoeffner commented 5 years ago

P.S.: I found a workaround to use the invisible character U+2000 (En Quad) but my phrases now all have the same size.

amueller commented 5 years ago

Can you please explain the "same size" issue?

Can you try the development version of wordcloud? I feel like we came across the issue before and hope we fixed it. If not I'll look into it.

twsl commented 4 years ago

I still get the same error, I tried the following regex to include composition nouns: regexp = r"(?<=')(?:\w+.?\w*)(?=')|(?:\w[\w']+)"

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-11-3ef17eac3749> in <module>
     15                       random_state=42,
     16                       regexp=regexp,
---> 17                      ).generate(str(result))
     18 
     19 print(wordcloud)

~\.conda\envs\ml\lib\site-packages\wordcloud\wordcloud.py in generate(self, text)
    617         self
    618         """
--> 619         return self.generate_from_text(text)
    620 
    621     def _check_generated(self):

~\.conda\envs\ml\lib\site-packages\wordcloud\wordcloud.py in generate_from_text(self, text)
    598         self
    599         """
--> 600         words = self.process_text(text)
    601         self.generate_from_frequencies(words)
    602         return self

~\.conda\envs\ml\lib\site-packages\wordcloud\wordcloud.py in process_text(self, text)
    575 
    576         if self.collocations:
--> 577             word_counts = unigrams_and_bigrams(words, self.normalize_plurals)
    578         else:
    579             word_counts, _ = process_tokens(words, self.normalize_plurals)

~\.conda\envs\ml\lib\site-packages\wordcloud\tokenization.py in unigrams_and_bigrams(words, normalize_plurals)
     54         # collocation detection (30 is arbitrary):
     55         word1 = standard_form[bigram[0].lower()]
---> 56         word2 = standard_form[bigram[1].lower()]
     57 
     58         if score(count, counts[word1], counts[word2], n_words) > 30:

KeyError: 'testing'
amueller commented 4 years ago

Thanks for the report. I won't have time to work on this for now, but feel free to investigate and send a PR.

tirth78 commented 2 years ago

I would like to work on this issue. I am a Masters student in BITS Pilani. It would be really helpful if I could get some kind of approval from the owner/author.

amueller commented 2 years ago

@tirth78 sure, go for it!

matiaso commented 11 months ago

To make it work you need to set collocations=False, as it assumes spaces are used to separate words.

KonradHoeffner commented 11 months ago

collocations=False results in error: unrecognized arguments: collocations=False, however --no_collocations works! I will however keep this issue open and let @amueller decide whether this counts as solved or not. Ideally, this would be enabled automatically if the regular expression includes a space and not crash. Tested using the newest version 1.9.2 from the Arch Linux python-wordcloud package.

matiaso commented 11 months ago

Indeed if you use the CLI, it is --no_collocations and collocations=False if you use the library. Thanks for the validation.