amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.17k stars 2.32k forks source link

Font construction slowing down the library #366

Open zhengpingzhou opened 6 years ago

zhengpingzhou commented 6 years ago

Thanks for developing such an amazing library.

I encountered some performance issues. I'm using it for poster generation (1500 x 2400). It takes about 30 seconds to generate a poster with mask. After setting scale = 12 (the maximum scale that yields acceptable result), it renders in 4 seconds, but still too slow for a web application for (probably impatient) users.

I profiled my code (shown below) with python -m cProfile --sort cumulative test.py >profile to show the most time-consuming functions and their callers:

import time

from PIL import Image
import jieba
import numpy as np
from wordcloud import WordCloud

inputs = open('words.txt').read().decode('utf-8')
words = '\n'.join([word for word in jieba.cut(inputs)])
scale = 12
mask = Image.open('mask.png')
width, height = mask.size
mask = np.array(mask.resize((int(mask.size[0] / float(scale)), int(mask.size[1] / float(scale)))))
wordcloud = WordCloud(scale=scale, background_color='white', width=int(width/float(scale)), height=int(height/float(scale)), margin=2, font_path='msyh.ttc', mask=mask).generate(words)
wordcloud.to_file('out.png')

... And got:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.013    0.013    4.087    4.087 test.py:1(<module>)
      271    0.002    0.000    1.453    0.005 ImageFont.py:216(truetype)
      271    0.002    0.000    1.451    0.005 ImageFont.py:119(__init__)
      271    1.448    0.005    1.448    0.005 {PIL._imagingft.getfont}
        1    0.000    0.000    1.415    1.415 wordcloud.py:556(generate)
        1    0.000    0.000    1.415    1.415 wordcloud.py:535(generate_from_text)
      2/1    0.058    0.029    1.392    1.392 wordcloud.py:331(generate_from_frequencies)

I was surprised that font construction was so time-consuming. Checking out the source code for wordcloud.py:generate_from_frequencies, it reveals that in this while-loop:

while True:
                # try to find a position
                font = ImageFont.truetype(self.font_path, font_size)
                # transpose font optionally
                transposed_font = ImageFont.TransposedFont(
                    font, orientation=orientation)
                # get size of resulting text
                box_size = draw.textsize(word, font=transposed_font)
                # find possible places using integral image:
                result = occupancy.sample_position(box_size[1] + self.margin,
                                                   box_size[0] + self.margin,
                                                   random_state)

It first constructs a font object, then uses it to get font size, then judges if there is a possible location. The problem is that even if the attempted location is not valid, we still have to construct a font object for text size computing (as required by the draw.textsize function). Run time of fonts for different languages tend to be similar.

So, I'm wondering if is there a better way to get text size in this looping procedure. I believe this to be a worthy optimization.

amueller commented 6 years ago

Thanks for the analysis. Can you run lineprof to show the lines that actually take the time?

Sent from phone. Please excuse spelling and brevity.

On Thu, Apr 26, 2018, 00:02 Paula15 notifications@github.com wrote:

Thanks for developing such an amazing library.

I encountered some performance issues. I'm using it for poster generation (1500 x 2400). It takes about 30 seconds to generate a poster with mask. After setting scale = 12 (the maximum scale that yields acceptable result), it renders in 4 seconds, but still too slow for a web application for (probably impatient) users.

I profiled my code (shown below) with python -m cProfile --sort cumulative test.py >profile to show the most time-consuming functions and their callers:

import time from PIL import Imageimport jiebaimport numpy as npfrom wordcloud import WordCloud

inputs = open('words.txt').read().decode('utf-8') words = '\n'.join([word for word in jieba.cut(inputs)]) scale = 12 mask = Image.open('mask.png') width, height = mask.size mask = np.array(mask.resize((int(mask.size[0] / float(scale)), int(mask.size[1] / float(scale))))) wordcloud = WordCloud(scale=scale, background_color='white', width=int(width/float(scale)), height=int(height/float(scale)), margin=2, font_path='msyh.ttc', mask=mask).generate(words) wordcloud.to_file('out.png')

... And got:

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.013 0.013 4.087 4.087 test.py:1() 271 0.002 0.000 1.453 0.005 ImageFont.py:216(truetype) 271 0.002 0.000 1.451 0.005 ImageFont.py:119(init) 271 1.448 0.005 1.448 0.005 {PIL._imagingft.getfont} 1 0.000 0.000 1.415 1.415 wordcloud.py:556(generate) 1 0.000 0.000 1.415 1.415 wordcloud.py:535(generate_from_text) 2/1 0.058 0.029 1.392 1.392 wordcloud.py:331(generate_from_frequencies)

I was surprised that font construction was so time-consuming. Checking out the source code for wordcloud.py:generate_from_frequencies, it reveals that in this for-loop:

while True:

try to find a position

            font = ImageFont.truetype(self.font_path, font_size)
            # transpose font optionally
            transposed_font = ImageFont.TransposedFont(
                font, orientation=orientation)
            # get size of resulting text
            box_size = draw.textsize(word, font=transposed_font)
            # find possible places using integral image:
            result = occupancy.sample_position(box_size[1] + self.margin,
                                               box_size[0] + self.margin,
                                               random_state)

It first constructs a font object, then use it to get font size, then judge if there is a possible location. The problem is that even if the attempted location is not valid, we still have to construct a font object for text size computing (as required by the draw.textsize function). Run time of fonts for different languages tend to be similar.

So, I'm wondering if is there a better way to get text size in this looping procedure. I believe this to be a worthy optimization.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amueller/word_cloud/issues/366, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFrARnIkZPr_BAsyTxFew7qRdQOzVks5tsUa9gaJpZM4TkfU0 .

zhengpingzhou commented 6 years ago

This time I set scale = 1 (which takes more time) to make the analysis clearer:

  1. Output of cProfile:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.013    0.013   24.774   24.774 test.py:1(<module>)
        1    0.000    0.000   20.865   20.865 wordcloud.py:556(generate)
        1    0.000    0.000   20.865   20.865 wordcloud.py:535(generate_from_text)
      2/1    0.777    0.388   20.842   20.842 wordcloud.py:331(generate_from_frequencies)
     2654    0.014    0.000   14.895    0.006 ImageFont.py:216(truetype)
     2654    0.024    0.000   14.881    0.006 ImageFont.py:119(__init__)
     2654   14.849    0.006   14.849    0.006 {PIL._imagingft.getfont}
  1. Output of line_profiler (This could be reproduced by first inserting @profile to wordcloud.py:WordCloud.generate_from_frequencies, generate_from_text, then run python line_profiler/kernprof.py -v -l test.py)
Total time: 20.3123 s
File: /usr/local/lib/python2.7/dist-packages/wordcloud/wordcloud.py
Function: generate_from_text at line 536

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   536                                               @profile
   537                                               def generate_from_text(self, text):
   554         1       8353.0   8353.0      0.0          words = self.process_text(text)
   555         1   20303948.0 20303948.0    100.0          self.generate_from_frequencies(words)
   556         1          3.0      3.0      0.0          return self
Total time: 20.2322 s
File: /usr/local/lib/python2.7/dist-packages/wordcloud/wordcloud.py
Function: generate_from_frequencies at line 331

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   331                                               @profile
   332                                               def generate_from_frequencies(self, frequencies, max_font_size=None):
   333                                                   """Create a word_cloud from words and frequencies.
   334
   335                                                   Parameters
   336                                                   ----------
   337                                                   frequencies : dict from string to float
   338                                                       A contains words and associated frequency.
   339
   340                                                   max_font_size : int
   341                                                       Use this font-size instead of self.max_font_size
   342
   343                                                   Returns
   344                                                   -------
   345                                                   self
   346
   347                                                   """
   348                                                   # make sure frequencies are sorted and normalized
   349         2         65.0     32.5      0.0          frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
   350         2          4.0      2.0      0.0          if len(frequencies) <= 0:
   351                                                       raise ValueError("We need at least 1 word to plot a word cloud, "
   352                                                                        "got %d." % len(frequencies))
   353         2          5.0      2.5      0.0          frequencies = frequencies[:self.max_words]
   354
   355                                                   # largest entry will be 1
   356         2          5.0      2.5      0.0          max_frequency = float(frequencies[0][1])
   357
   358         2          3.0      1.5      0.0          frequencies = [(word, freq / max_frequency)
   359       204        262.0      1.3      0.0                         for word, freq in frequencies]
   360
   361         2          2.0      1.0      0.0          if self.random_state is not None:
   362                                                       random_state = self.random_state
   363                                                   else:
   364         2       1095.0    547.5      0.0              random_state = Random()
   365
   366         2          4.0      2.0      0.0          if self.mask is not None:
   367         2          3.0      1.5      0.0              mask = self.mask
   368         2          8.0      4.0      0.0              width = mask.shape[1]
   369         2          4.0      2.0      0.0              height = mask.shape[0]
   370         2          6.0      3.0      0.0              if mask.dtype.kind == 'f':
   371                                                           warnings.warn("mask image should be unsigned byte between 0"
   372                                                                         " and 255. Got a float array")
   373         2          3.0      1.5      0.0              if mask.ndim == 2:
   374                                                           boolean_mask = mask == 255
   375         2          2.0      1.0      0.0              elif mask.ndim == 3:
   376                                                           # if all channels are white, mask out
   377         2      85074.0  42537.0      0.4                  boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1)
   378                                                       else:
   379                                                           raise ValueError("Got mask of invalid shape: %s"
   380                                                                            % str(mask.shape))
   381                                                   else:
   382                                                       boolean_mask = None
   383                                                       height, width = self.height, self.width
   384         2     245183.0 122591.5      1.2          occupancy = IntegralOccupancyMap(height, width, boolean_mask)
   385
   386                                                   # create image
   387         2        754.0    377.0      0.0          img_grey = Image.new("L", (width, height))
   388         2         70.0     35.0      0.0          draw = ImageDraw.Draw(img_grey)
   389         2       2131.0   1065.5      0.0          img_array = np.asarray(img_grey)
   390         2          6.0      3.0      0.0          font_sizes, positions, orientations, colors = [], [], [], []
   391
   392         2          3.0      1.5      0.0          last_freq = 1.
   393
   394         2          4.0      2.0      0.0          if max_font_size is None:
   395                                                       # if not provided use default font_size
   396         1          3.0      3.0      0.0              max_font_size = self.max_font_size
   397
   398         2          3.0      1.5      0.0          if max_font_size is None:
   399                                                       # figure out a good font size by trying to draw with
   400                                                       # just the first two words
   401         1          2.0      2.0      0.0              if len(frequencies) == 1:
   402                                                           # we only have one word. We make it big!
   403                                                           font_size = self.height
   404                                                       else:
   405         1          6.0      6.0      0.0                  self.generate_from_frequencies(dict(frequencies[:2]),
   406         1          7.0      7.0      0.0                                                 max_font_size=self.height)
   407                                                           # find font sizes
   408         3          8.0      2.7      0.0                  sizes = [x[1] for x in self.layout_]
   409         1          1.0      1.0      0.0                  try:
   410         1          2.0      2.0      0.0                      font_size = int(2 * sizes[0] * sizes[1]
   411         1          4.0      4.0      0.0                                      / (sizes[0] + sizes[1]))
   412                                                           # quick fix for if self.layout_ contains less than 2 values
   413                                                           # on very small images it can be empty
   414                                                           except IndexError:
   415                                                               try:
   416                                                                   font_size = sizes[0]
   417                                                               except IndexError:
   418                                                                   raise ValueError('canvas size is too small')
   419                                                   else:
   420         1          2.0      2.0      0.0              font_size = max_font_size
   421
   422                                                   # we set self.words_ here because we called generate_from_frequencies
   423                                                   # above... hurray for good design?
   424         2         38.0     19.0      0.0          self.words_ = dict(frequencies)
   425
   426                                                   # start drawing grey image
   427       204        453.0      2.2      0.0          for word, freq in frequencies:
   428                                                       # select the font size
   429       202        440.0      2.2      0.0              rs = self.relative_scaling
   430       202        451.0      2.2      0.0              if rs != 0:
   431       202       1126.0      5.6      0.0                  font_size = int(round((rs * (freq / float(last_freq))
   432       202        945.0      4.7      0.0                                         + (1 - rs)) * font_size))
   433       202        707.0      3.5      0.0              if random_state.random() < self.prefer_horizontal:
   434       189        250.0      1.3      0.0                  orientation = None
   435                                                       else:
   436        13         48.0      3.7      0.0                  orientation = Image.ROTATE_90
   437       202        312.0      1.5      0.0              tried_other_orientation = False
   438      2461       3809.0      1.5      0.0              while True:
   439                                                           # try to find a position
   440      2461   13237443.0   5378.9     65.4                  font = ImageFont.truetype(self.font_path, font_size)
   441                                                           # transpose font optionally
   442      2461      12059.0      4.9      0.1                  transposed_font = ImageFont.TransposedFont(
   443      2461     503452.0    204.6      2.5                      font, orientation=orientation)
   444                                                           # get size of resulting text
   445      2461     589085.0    239.4      2.9                  box_size = draw.textsize(word, font=transposed_font)
   446                                                           # find possible places using integral image:
   447      2461       7461.0      3.0      0.0                  result = occupancy.sample_position(box_size[1] + self.margin,
   448      2461       3702.0      1.5      0.0                                                     box_size[0] + self.margin,
   449      2461    1774825.0    721.2      8.8                                                     random_state)
   450      2461       5002.0      2.0      0.0                  if result is not None or font_size < self.min_font_size:
   451                                                               # either we found a place or font-size went too small
   452       202        302.0      1.5      0.0                      break
   453                                                           # if we didn't find a place, make font smaller
   454                                                           # but first try to rotate!
   455      2259       4179.0      1.8      0.0                  if not tried_other_orientation and self.prefer_horizontal < 1:
   456        25         56.0      2.2      0.0                      orientation = (Image.ROTATE_90 if orientation is None else
   457         1          2.0      2.0      0.0                                     Image.ROTATE_90)
   458        25         42.0      1.7      0.0                      tried_other_orientation = True
   459                                                           else:
   460      2234       4025.0      1.8      0.0                      font_size -= self.font_step
   461      2234       3322.0      1.5      0.0                      orientation = None
   462
   463       202        517.0      2.6      0.0              if font_size < self.min_font_size:
   464                                                           # we were unable to draw any more
   465                                                           break
   466
   467       202       7432.0     36.8      0.0              x, y = np.array(result) + self.margin // 2
   468                                                       # actually draw the text
   469       202      57098.0    282.7      0.3              draw.text((y, x), word, fill="white", font=transposed_font)
   470       202        581.0      2.9      0.0              positions.append((x, y))
   471       202        355.0      1.8      0.0              orientations.append(orientation)
   472       202        313.0      1.5      0.0              font_sizes.append(font_size)
   473       202        387.0      1.9      0.0              colors.append(self.color_func(word, font_size=font_size,
   474       202        283.0      1.4      0.0                                            position=(x, y),
   475       202        260.0      1.3      0.0                                            orientation=orientation,
   476       202        263.0      1.3      0.0                                            random_state=random_state,
   477       202      28027.0    138.7      0.1                                            font_path=self.font_path))
   478                                                       # recompute integral image
   479       202        434.0      2.1      0.0              if self.mask is None:
   480                                                           img_array = np.asarray(img_grey)
   481                                                       else:
   482       202     368152.0   1822.5      1.8                  img_array = np.asarray(img_grey) + boolean_mask
   483                                                       # recompute bottom right
   484                                                       # the order of the cumsum's is important for speed ?!
   485       202    3279129.0  16233.3     16.2              occupancy.update(img_array, x, y)
   486       202        694.0      3.4      0.0              last_freq = freq
   487
   488         2          3.0      1.5      0.0          self.layout_ = list(zip(frequencies, font_sizes, positions,
   489         2         46.0     23.0      0.0                                  orientations, colors))
   490         2          4.0      2.0      0.0          return self

The line font = ImageFont.truetype(self.font_path, font_size) in the pointed while-loop takes about 65% of run time in generate_from_frequency, which is the major time cost in this case.

amueller commented 6 years ago

it looks like most of this time is actually spend in getfont in https://github.com/python-pillow/Pillow/blob/4936b447f004446e309291901b0779528f7d94d6/src/_imagingft.c. That makes it harder to profile or refactor. I thought we might be able to not reread the file and just change the size, but I don't see how to do that. We could try and implement the textsize operation ourselves in cython, but that seems a bit painful. Feel free to try it, though.

mikesmith1611 commented 6 years ago

@Paula15 Have you managed to get any improvement on this?

amueller commented 6 years ago

I don't think so, please feel free to jump in.

Sent from phone. Please excuse spelling and brevity.

On Tue, Sep 4, 2018, 06:17 Michael Smith notifications@github.com wrote:

@Paula15 https://github.com/Paula15 Have you managed to get any improvement on this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amueller/word_cloud/issues/366#issuecomment-418315370, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFlMoBWZll4gkWD45ZFjoWf-7s9BTks5uXlNCgaJpZM4TkfU0 .