Open zhengpingzhou opened 6 years ago
Thanks for the analysis. Can you run lineprof to show the lines that actually take the time?
Sent from phone. Please excuse spelling and brevity.
On Thu, Apr 26, 2018, 00:02 Paula15 notifications@github.com wrote:
Thanks for developing such an amazing library.
I encountered some performance issues. I'm using it for poster generation (1500 x 2400). It takes about 30 seconds to generate a poster with mask. After setting scale = 12 (the maximum scale that yields acceptable result), it renders in 4 seconds, but still too slow for a web application for (probably impatient) users.
I profiled my code (shown below) with python -m cProfile --sort cumulative test.py >profile to show the most time-consuming functions and their callers:
import time from PIL import Imageimport jiebaimport numpy as npfrom wordcloud import WordCloud
inputs = open('words.txt').read().decode('utf-8') words = '\n'.join([word for word in jieba.cut(inputs)]) scale = 12 mask = Image.open('mask.png') width, height = mask.size mask = np.array(mask.resize((int(mask.size[0] / float(scale)), int(mask.size[1] / float(scale))))) wordcloud = WordCloud(scale=scale, background_color='white', width=int(width/float(scale)), height=int(height/float(scale)), margin=2, font_path='msyh.ttc', mask=mask).generate(words) wordcloud.to_file('out.png')
... And got:
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.013 0.013 4.087 4.087 test.py:1(
) 271 0.002 0.000 1.453 0.005 ImageFont.py:216(truetype) 271 0.002 0.000 1.451 0.005 ImageFont.py:119(init) 271 1.448 0.005 1.448 0.005 {PIL._imagingft.getfont} 1 0.000 0.000 1.415 1.415 wordcloud.py:556(generate) 1 0.000 0.000 1.415 1.415 wordcloud.py:535(generate_from_text) 2/1 0.058 0.029 1.392 1.392 wordcloud.py:331(generate_from_frequencies) I was surprised that font construction was so time-consuming. Checking out the source code for wordcloud.py:generate_from_frequencies, it reveals that in this for-loop:
while True:
try to find a position
font = ImageFont.truetype(self.font_path, font_size) # transpose font optionally transposed_font = ImageFont.TransposedFont( font, orientation=orientation) # get size of resulting text box_size = draw.textsize(word, font=transposed_font) # find possible places using integral image: result = occupancy.sample_position(box_size[1] + self.margin, box_size[0] + self.margin, random_state)
It first constructs a font object, then use it to get font size, then judge if there is a possible location. The problem is that even if the attempted location is not valid, we still have to construct a font object for text size computing (as required by the draw.textsize function). Run time of fonts for different languages tend to be similar.
So, I'm wondering if is there a better way to get text size in this looping procedure. I believe this to be a worthy optimization.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amueller/word_cloud/issues/366, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFrARnIkZPr_BAsyTxFew7qRdQOzVks5tsUa9gaJpZM4TkfU0 .
This time I set scale = 1
(which takes more time) to make the analysis clearer:
cProfile
:ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 24.774 24.774 test.py:1(<module>)
1 0.000 0.000 20.865 20.865 wordcloud.py:556(generate)
1 0.000 0.000 20.865 20.865 wordcloud.py:535(generate_from_text)
2/1 0.777 0.388 20.842 20.842 wordcloud.py:331(generate_from_frequencies)
2654 0.014 0.000 14.895 0.006 ImageFont.py:216(truetype)
2654 0.024 0.000 14.881 0.006 ImageFont.py:119(__init__)
2654 14.849 0.006 14.849 0.006 {PIL._imagingft.getfont}
line_profiler
(This could be reproduced by first inserting @profile
to wordcloud.py:WordCloud.generate_from_frequencies, generate_from_text
, then run python line_profiler/kernprof.py -v -l test.py
)Total time: 20.3123 s
File: /usr/local/lib/python2.7/dist-packages/wordcloud/wordcloud.py
Function: generate_from_text at line 536
Line # Hits Time Per Hit % Time Line Contents
==============================================================
536 @profile
537 def generate_from_text(self, text):
554 1 8353.0 8353.0 0.0 words = self.process_text(text)
555 1 20303948.0 20303948.0 100.0 self.generate_from_frequencies(words)
556 1 3.0 3.0 0.0 return self
Total time: 20.2322 s
File: /usr/local/lib/python2.7/dist-packages/wordcloud/wordcloud.py
Function: generate_from_frequencies at line 331
Line # Hits Time Per Hit % Time Line Contents
==============================================================
331 @profile
332 def generate_from_frequencies(self, frequencies, max_font_size=None):
333 """Create a word_cloud from words and frequencies.
334
335 Parameters
336 ----------
337 frequencies : dict from string to float
338 A contains words and associated frequency.
339
340 max_font_size : int
341 Use this font-size instead of self.max_font_size
342
343 Returns
344 -------
345 self
346
347 """
348 # make sure frequencies are sorted and normalized
349 2 65.0 32.5 0.0 frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
350 2 4.0 2.0 0.0 if len(frequencies) <= 0:
351 raise ValueError("We need at least 1 word to plot a word cloud, "
352 "got %d." % len(frequencies))
353 2 5.0 2.5 0.0 frequencies = frequencies[:self.max_words]
354
355 # largest entry will be 1
356 2 5.0 2.5 0.0 max_frequency = float(frequencies[0][1])
357
358 2 3.0 1.5 0.0 frequencies = [(word, freq / max_frequency)
359 204 262.0 1.3 0.0 for word, freq in frequencies]
360
361 2 2.0 1.0 0.0 if self.random_state is not None:
362 random_state = self.random_state
363 else:
364 2 1095.0 547.5 0.0 random_state = Random()
365
366 2 4.0 2.0 0.0 if self.mask is not None:
367 2 3.0 1.5 0.0 mask = self.mask
368 2 8.0 4.0 0.0 width = mask.shape[1]
369 2 4.0 2.0 0.0 height = mask.shape[0]
370 2 6.0 3.0 0.0 if mask.dtype.kind == 'f':
371 warnings.warn("mask image should be unsigned byte between 0"
372 " and 255. Got a float array")
373 2 3.0 1.5 0.0 if mask.ndim == 2:
374 boolean_mask = mask == 255
375 2 2.0 1.0 0.0 elif mask.ndim == 3:
376 # if all channels are white, mask out
377 2 85074.0 42537.0 0.4 boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1)
378 else:
379 raise ValueError("Got mask of invalid shape: %s"
380 % str(mask.shape))
381 else:
382 boolean_mask = None
383 height, width = self.height, self.width
384 2 245183.0 122591.5 1.2 occupancy = IntegralOccupancyMap(height, width, boolean_mask)
385
386 # create image
387 2 754.0 377.0 0.0 img_grey = Image.new("L", (width, height))
388 2 70.0 35.0 0.0 draw = ImageDraw.Draw(img_grey)
389 2 2131.0 1065.5 0.0 img_array = np.asarray(img_grey)
390 2 6.0 3.0 0.0 font_sizes, positions, orientations, colors = [], [], [], []
391
392 2 3.0 1.5 0.0 last_freq = 1.
393
394 2 4.0 2.0 0.0 if max_font_size is None:
395 # if not provided use default font_size
396 1 3.0 3.0 0.0 max_font_size = self.max_font_size
397
398 2 3.0 1.5 0.0 if max_font_size is None:
399 # figure out a good font size by trying to draw with
400 # just the first two words
401 1 2.0 2.0 0.0 if len(frequencies) == 1:
402 # we only have one word. We make it big!
403 font_size = self.height
404 else:
405 1 6.0 6.0 0.0 self.generate_from_frequencies(dict(frequencies[:2]),
406 1 7.0 7.0 0.0 max_font_size=self.height)
407 # find font sizes
408 3 8.0 2.7 0.0 sizes = [x[1] for x in self.layout_]
409 1 1.0 1.0 0.0 try:
410 1 2.0 2.0 0.0 font_size = int(2 * sizes[0] * sizes[1]
411 1 4.0 4.0 0.0 / (sizes[0] + sizes[1]))
412 # quick fix for if self.layout_ contains less than 2 values
413 # on very small images it can be empty
414 except IndexError:
415 try:
416 font_size = sizes[0]
417 except IndexError:
418 raise ValueError('canvas size is too small')
419 else:
420 1 2.0 2.0 0.0 font_size = max_font_size
421
422 # we set self.words_ here because we called generate_from_frequencies
423 # above... hurray for good design?
424 2 38.0 19.0 0.0 self.words_ = dict(frequencies)
425
426 # start drawing grey image
427 204 453.0 2.2 0.0 for word, freq in frequencies:
428 # select the font size
429 202 440.0 2.2 0.0 rs = self.relative_scaling
430 202 451.0 2.2 0.0 if rs != 0:
431 202 1126.0 5.6 0.0 font_size = int(round((rs * (freq / float(last_freq))
432 202 945.0 4.7 0.0 + (1 - rs)) * font_size))
433 202 707.0 3.5 0.0 if random_state.random() < self.prefer_horizontal:
434 189 250.0 1.3 0.0 orientation = None
435 else:
436 13 48.0 3.7 0.0 orientation = Image.ROTATE_90
437 202 312.0 1.5 0.0 tried_other_orientation = False
438 2461 3809.0 1.5 0.0 while True:
439 # try to find a position
440 2461 13237443.0 5378.9 65.4 font = ImageFont.truetype(self.font_path, font_size)
441 # transpose font optionally
442 2461 12059.0 4.9 0.1 transposed_font = ImageFont.TransposedFont(
443 2461 503452.0 204.6 2.5 font, orientation=orientation)
444 # get size of resulting text
445 2461 589085.0 239.4 2.9 box_size = draw.textsize(word, font=transposed_font)
446 # find possible places using integral image:
447 2461 7461.0 3.0 0.0 result = occupancy.sample_position(box_size[1] + self.margin,
448 2461 3702.0 1.5 0.0 box_size[0] + self.margin,
449 2461 1774825.0 721.2 8.8 random_state)
450 2461 5002.0 2.0 0.0 if result is not None or font_size < self.min_font_size:
451 # either we found a place or font-size went too small
452 202 302.0 1.5 0.0 break
453 # if we didn't find a place, make font smaller
454 # but first try to rotate!
455 2259 4179.0 1.8 0.0 if not tried_other_orientation and self.prefer_horizontal < 1:
456 25 56.0 2.2 0.0 orientation = (Image.ROTATE_90 if orientation is None else
457 1 2.0 2.0 0.0 Image.ROTATE_90)
458 25 42.0 1.7 0.0 tried_other_orientation = True
459 else:
460 2234 4025.0 1.8 0.0 font_size -= self.font_step
461 2234 3322.0 1.5 0.0 orientation = None
462
463 202 517.0 2.6 0.0 if font_size < self.min_font_size:
464 # we were unable to draw any more
465 break
466
467 202 7432.0 36.8 0.0 x, y = np.array(result) + self.margin // 2
468 # actually draw the text
469 202 57098.0 282.7 0.3 draw.text((y, x), word, fill="white", font=transposed_font)
470 202 581.0 2.9 0.0 positions.append((x, y))
471 202 355.0 1.8 0.0 orientations.append(orientation)
472 202 313.0 1.5 0.0 font_sizes.append(font_size)
473 202 387.0 1.9 0.0 colors.append(self.color_func(word, font_size=font_size,
474 202 283.0 1.4 0.0 position=(x, y),
475 202 260.0 1.3 0.0 orientation=orientation,
476 202 263.0 1.3 0.0 random_state=random_state,
477 202 28027.0 138.7 0.1 font_path=self.font_path))
478 # recompute integral image
479 202 434.0 2.1 0.0 if self.mask is None:
480 img_array = np.asarray(img_grey)
481 else:
482 202 368152.0 1822.5 1.8 img_array = np.asarray(img_grey) + boolean_mask
483 # recompute bottom right
484 # the order of the cumsum's is important for speed ?!
485 202 3279129.0 16233.3 16.2 occupancy.update(img_array, x, y)
486 202 694.0 3.4 0.0 last_freq = freq
487
488 2 3.0 1.5 0.0 self.layout_ = list(zip(frequencies, font_sizes, positions,
489 2 46.0 23.0 0.0 orientations, colors))
490 2 4.0 2.0 0.0 return self
The line font = ImageFont.truetype(self.font_path, font_size)
in the pointed while-loop takes about 65% of run time in generate_from_frequency
, which is the major time cost in this case.
it looks like most of this time is actually spend in getfont in https://github.com/python-pillow/Pillow/blob/4936b447f004446e309291901b0779528f7d94d6/src/_imagingft.c. That makes it harder to profile or refactor. I thought we might be able to not reread the file and just change the size, but I don't see how to do that. We could try and implement the textsize operation ourselves in cython, but that seems a bit painful. Feel free to try it, though.
@Paula15 Have you managed to get any improvement on this?
I don't think so, please feel free to jump in.
Sent from phone. Please excuse spelling and brevity.
On Tue, Sep 4, 2018, 06:17 Michael Smith notifications@github.com wrote:
@Paula15 https://github.com/Paula15 Have you managed to get any improvement on this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amueller/word_cloud/issues/366#issuecomment-418315370, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFlMoBWZll4gkWD45ZFjoWf-7s9BTks5uXlNCgaJpZM4TkfU0 .
Thanks for developing such an amazing library.
I encountered some performance issues. I'm using it for poster generation (1500 x 2400). It takes about 30 seconds to generate a poster with mask. After setting
scale = 12
(the maximum scale that yields acceptable result), it renders in 4 seconds, but still too slow for a web application for (probably impatient) users.I profiled my code (shown below) with
python -m cProfile --sort cumulative test.py >profile
to show the most time-consuming functions and their callers:... And got:
I was surprised that font construction was so time-consuming. Checking out the source code for
wordcloud.py:generate_from_frequencies
, it reveals that in this while-loop:It first constructs a font object, then uses it to get font size, then judges if there is a possible location. The problem is that even if the attempted location is not valid, we still have to construct a font object for text size computing (as required by the
draw.textsize
function). Run time of fonts for different languages tend to be similar.So, I'm wondering if is there a better way to get text size in this looping procedure. I believe this to be a worthy optimization.