apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.49k stars 989 forks source link

Why is Kuromoji tokenization throughput bimodal? [LUCENE-9457] #10497

Open asfimport opened 3 years ago

asfimport commented 3 years ago

With the recent accidental regression of Japanese (Kuromoji) tokenization throughput due to exciting FST optimizations, we added new nightly Lucene benchmarks to measure tokenization throughput for JapaneseTokenizerhttps://home.apache.org/\~mikemccand/lucenebench/analyzers.html

It has already been running for \5-6 weeks now!  But for some reason, it looks bi-modal?  "Normally" it is \.45 M tokens/sec, but for two data points it dropped down to \~.33 M tokens/sec, which is odd.  It could be hotspot noise maybe?  But would be good to get to the root cause and fix it if possible.

Hotspot noise that randomly steals \~27% of your tokenization throughput is no good!!

Or does anyone have any other ideas of what could be bi-modal in Kuromoji?  I don't think this performance test has any randomness in it...


Migrated from LUCENE-9457 by Michael McCandless (@mikemccand), updated Aug 15 2020

asfimport commented 3 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

It could be hotspot noise maybe?

Could be. Or it could be something else running in the background? It'd be good to somehow monitor background CPU activity while these benchmarks are being made. I'm not much of a sysop to help out here though.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yeah that is one possible theory, but, this machine (dedicated physical box) is very idle and only runs Lucene's nightly benchmarks.  Also, the other benchmarks run on those same timestamps (e.g. the other analyzers) did not also seem to show a performance drop.  So I think it is not likely a time specific environmental issue ...

asfimport commented 3 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

It's one of those things that are exciting to debug, take days to complete and sometimes never reach any reasonable explanation. :)

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's one of those things that are exciting to debug, take days to complete and sometimes never reach any reasonable explanation. :)

LOL I fear you have already handled too many such cases!

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Maybe something like https://github.com/mikemccand/luceneutil/issues/77 would help