fukuball / jieba-php

"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.
http://jieba-php.fukuball.com
MIT License
1.32k stars 260 forks source link

给DAG处理增加缓存,提升性能 #28

Closed tianhe1986 closed 7 years ago

tianhe1986 commented 7 years ago

尝试从以下两个方面提高性能。 1 self::$trie命中的词组,无需再做一次end相关的比较。 2 self::$trie未命中的词组,若多次出现,则第二次开始,无需再去调用MultiArray::get并最终用MultiArray::getValue进行递归遍历。 因为是跟self::$trie相关,所以在self::$trie作修改的时候需要清除缓存。

我用如下代码在浏览器中进行测试,跟之前对比,大概能有百分之二十左右的提升。像"是否和"一词在lyric.txt.中出现了8次,对它的处理就会比之前更快。

$top_k = 10;
$content = file_get_contents(dirname(__FILE__)."/../src/dict/lyric.txt", "r");
$t1 = microtime(true);
for ($i = 0; $i < 100; $i++) {
    Jieba::$dag_cache = array();
    $tags = JiebaAnalyse::extractTags($content, $top_k);
}
$t2 = microtime(true);
echo ($t2 - $t1)."<br>\n";

不知对 #27 会不会有所帮助。

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 64.029% when pulling 35956ce04b702de3a7787246db13a2383474b18b on tianhe1986:develop_dag_cache into 26f4b643301901ccada07014e21631364835ae01 on fukuball:master.

codecov-io commented 7 years ago

Codecov Report

Merging #28 into master will increase coverage by 0.18%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
+ Coverage   64.07%   64.26%   +0.18%     
==========================================
  Files           5        5              
  Lines        1119     1122       +3     
==========================================
+ Hits          717      721       +4     
+ Misses        402      401       -1
Impacted Files Coverage Δ
src/class/Jieba.php 93.55% <100%> (+0.41%) :arrow_up:
src/class/Posseg.php 83.84% <0%> (-0.09%) :arrow_down:
src/class/Finalseg.php 98.42% <0%> (-0.06%) :arrow_down:
src/class/JiebaAnalyse.php 100% <0%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 26f4b64...35956ce. Read the comment docs.

fukuball commented 7 years ago

@tianhe1986 太感謝了,多少會有些幫助的!