amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.16k stars 2.32k forks source link

relative_scaling=1 and repeat=True results in large space between words #522

Open changbowen opened 4 years ago

changbowen commented 4 years ago

Description

New to word cloud... I want to generate one from a few strings and being able to customize the weight of each string. Following the documents I used generate_from_frequencies, relative_scaling=1 and repeat=True. However it results in large space between words. Removing relative_scaling=1 works fine but the font sizes does not match the weight / frequency value.

Steps/Code to Reproduce

wc = wordcloud.WordCloud(
    relative_scaling=1,
    max_words=200,
    height=800,
    width=800,
    prefer_horizontal=1,
    repeat=True,
    ).generate_from_frequencies({
        'Web Hotel': 3,
        'NGINX': 1,
        'hosting': 1,
        'Apache': 1,
        'Docker': 1,
        'Kubernetes': 1,
        'IIS': 1,
        'Node.js': 1,
        'Tomcat': 1,
        'MySQL': 1,
        'MongoDB': 1,
        'SQL Server': 1,
        'PostgreSQL': 1,
    })

wc.to_file('test.png')

Expected Results

An image with words filled in tightly.

Actual Results

test

Versions

Linux-4.4.0-18362-Microsoft-x86_64-with-Ubuntu-18.04-bionic Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] NumPy 1.18.1 matplotlib 3.1.2 wordcoud 1.6.0.post14+g1fc6868

amueller commented 4 years ago

Can you try increasing max_font_size? Sorry for the slow reply.

changbowen commented 4 years ago

Tried with max_font_size=2000, fonts are larger but still there is the gap between words. Here is the test image generated: test

amueller commented 4 years ago

Thanks, I understand the issue now. I guess the current algorithm for repeat is a bit strange. I'm honestly not sure if repeat and relative_scaling=1 make sense together. Would you like all appearances of Web Hotel to have the same size?

Right now what happens is that it renders everything without repetition while respecting the relative frequencies. If there's space left, it then decreases the font size and starts over. That way it's impossible to fill the space with relative_scaling=1. Honestly I'm not sure there is a way to fill the whole space with these requirements. How would you do it?

changbowen commented 4 years ago

What I was trying to do is to "manually" specify some words to have higher "weight" (so they can appear larger) because I don't have a long source text for it to generate from. Is there any way to achieve that?

amueller commented 4 years ago

Well the question is what the size of the word with the highest weight should be when it's plotted the second time.

Let's say you only have two words, say 'two' and 'words', and you want 'two' to be twice as big as 'words'. If the original size of 'two' is 40, the size of 'words' will be 20. Do you now want 'two' to appear several times with size 40 and 'words' with size 20?

I guess that would be possible, though it would be quite different from what wordcloud is doing now. And you'll still end up with a bunch of holes, and the bigger the initial font is, the more holes you'd get.

I assumed you'd want the words to get progressively smaller, mostly because that's what wordcloud does right now. In that case it's unclear to me how to attach sizes to the first repeated word.

amueller commented 4 years ago

That would look like this: image

amueller commented 4 years ago

Or like this: image

I'm not sure if that's the desired outcome? But I feel like we should probably allow passing lists of frequencies again, which was supported at some point :-/

changbowen commented 4 years ago

Thanks for taking time to test these. Your assumption is right and I do think the fonts should be progressively smaller and thus filling the holes. Something like this: Word Art

The image is generated from wordart.com, powerful tool but only when you pay :)

amueller commented 4 years ago

This doesn't preserve the ratios as you can see, so you can already get this ;)