Closed dimbleby closed 9 months ago
It's tricky -- the algorithm we're using here is not novel (that is, it comes from a book chapter by Peter Norvig). As such, although I ported the existing algorithm to Rust, I don't have deep expertise in refining the algorithm.
However, your refinement does make some sense -- have you checked how it affects the benchmarks? Presumably non-words would be somewhat rare in most inputs.
@arnoldkalmbach do you have any thoughts on this?
the version proposed above does modest damage to the benchmarks by about 10-15% on my computer.
But here's a variation that actually seems to improve performance, if anything:
--- a/instant-segment/src/lib.rs
+++ b/instant-segment/src/lib.rs
@@ -105,7 +105,11 @@ impl Segmenter {
Some((uni, bi_scores)) => (uni, bi_scores),
// Penalize words not found in the unigrams according
// to their length, a crucial heuristic.
- None => return 1.0 - self.uni_total_log10 - word.len() as f64,
+ None => {
+ let word_len = word.len() as f64;
+ let word_count = word_len / 5.0;
+ return (1.0 - self.uni_total_log10 - word_len) * word_count
+ },
};
that doesn't do anything much on the existing benchmarks, but if I add a new "extra long" one with the text itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishnessitwastheepochofbeliefitwastheepochofincredulityitwastheseasonoflightitwastheseasonofdarknessitwasthespringofhopeitwasthewinterofdespair
- performance on that improves by about 10%
I'm sure you are right that long non-words don't appear in most texts, what I am doing is surely a niche use case.
I'd review a PR along those lines. Please also add a bunch of context to the comment above.
I've probably an unusual use case here...
I'm writing a solver for a puzzle that takes blocks of letters eg
["SIS", "FUN", "THI"]
and rearranges them to form something sensible (song lyrics in the actual puzzle but eg here"THIS IS FUN"
). I'm using instant-segment as a way of scoring the proposed solutions.What I am seeing is that the scoring is a little too happy with solutions that contain long non-words. Eg here are two "sentences" that score approximately the same:
... where my intuition says that these are not similar in quality at all:
ildimgamkeshakquprslip
really ought to do much more damage in the first.I reckon the problem is that the penalty for a long non-word is paid only once, no matter how long that non-word is: and that in the context of a long-enough sentence the additional "word length" penalty is insufficient.
I suggest that if there's a long string of nonsense in the middle of your sentence then the total penalty should be roughly: the per-word penalty multiplied by the number of words that ought to have been there -
which - I think! - makes some sort of sense in the probablistic framework; whereas I can't find an intuition to match the current heuristic.
And it generates better solutions, faster, for my silly puzzle.
What do you think?