how to get reproducible wordclouds (identical, except for image dimensions)

abubelinha commented 1 year ago

Description

I would like to reproduce identical wordcloud images (same word positioning & colors) with different image dimensions. I don't want to control word positioning nor colors: I just want resulting cloud positions & colors to be repeatable across different runs of my script (or if someone else runs my code), at different image sizes.

I would expect changing width & height plus setting a random_state should be enough.

Or maybe I just misunderstood what is random_state intended usage. I couldn't find it explained in documentation.

Steps/Code to Reproduce

def wordcloud_minimal_example(width=500, height=250):
    freqdict = {'Word1': 25, 'Word2': 34, 'Word3': 12, 
        'Word4': 44, 'Word5': 34, 'Word6': 12, 
        'Word7': 11, 'Word8': 15}
    from wordcloud import WordCloud
    wc = WordCloud(width=width,height=height,random_state=1)
    wc.generate_from_frequencies(freqdict)
    image = wc.to_file("wordcloud_{}x{}.png".format(width,height))

for factor in range(2,4):
        width = 100*factor
        height = 50*factor
        wordcloud_minimal_example(width,height);

Expected Results

Two identical wordclouds, except for image width and height. i.e. same result as if I first produce the big image and then I use image processing software to reduce its dimensions.

Actual Results

For a given size, the same image output is achieved (same colors and positions as in previous script runs).

But comparing different sized images, their colors and positions are different.

Versions

Windows-7-6.1.7601-SP1 Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] NumPy 1.22.4 matplotlib 3.6.3 wordcloud 1.9.1.1

amueller commented 1 year ago

If you want identical output, you can change scale, which should return identical results. If you run with different sizes, the results will be different because the font size calculations are mostly in absolute terms, not relative terms. The max_font_size is the main factor determining the look of the image, and there's a heuristic that tries to adjust it, but depending on the value of relative_scaling, font_step is used, which is also done in absolute terms. In other words, I made no attempt to make the layout stable wrt changing the image size. If you simply want different sized outputs, use scale. It might be possible to get consistent output by changing all of the variables relating to font size in a consistent way, but I'm not sure what the purpose of that would be (as by definition, you'd get the same result as with using scale if you succeed).

Controlling random_state should give you consistent results with the same sizes, but if you change the input to the algorithm, the output changes accordingly.

abubelinha commented 1 year ago

Thanks for explanations. I'll try

amueller / word_cloud