djc / instant-segment

Fast English word segmentation in Rust
Apache License 2.0
90 stars 6 forks source link

non-words are under-penalized #53

Closed dimbleby closed 9 months ago

dimbleby commented 9 months ago

I've probably an unusual use case here...

I'm writing a solver for a puzzle that takes blocks of letters eg ["SIS", "FUN", "THI"] and rearranges them to form something sensible (song lyrics in the actual puzzle but eg here "THIS IS FUN"). I'm using instant-segment as a way of scoring the proposed solutions.

What I am seeing is that the scoring is a little too happy with solutions that contain long non-words. Eg here are two "sentences" that score approximately the same:

but now he got begin licence beer in curly we get home pretty less she leaves c he
shakes by spatial their write your the captain same shall be shaken an illegal begin
times name winter scrunched ildimgamkeshakquprslip got out thanks because blood and
clean baby i my baby able a prince on the baby colette be give you just gonna in abyssal
only shakes in leadbelly
begin lot blood i mean price be the princely but names be given owning am engineer c he
prices and the baby less she will burn la lives because began ste your times shakes only
just got the lake counties pale a captain and dilly on the games lynn a baby a blank go
to mr shaky shake shake shall be we get the baby you create clip their ounces chilly i m
writes squad baby let

... where my intuition says that these are not similar in quality at all: ildimgamkeshakquprslip really ought to do much more damage in the first.

I reckon the problem is that the penalty for a long non-word is paid only once, no matter how long that non-word is: and that in the context of a long-enough sentence the additional "word length" penalty is insufficient.

I suggest that if there's a long string of nonsense in the middle of your sentence then the total penalty should be roughly: the per-word penalty multiplied by the number of words that ought to have been there -

--- a/instant-segment/src/lib.rs
+++ b/instant-segment/src/lib.rs
@@ -105,7 +105,14 @@ impl Segmenter {
             Some((uni, bi_scores)) => (uni, bi_scores),
             // Penalize words not found in the unigrams according
             // to their length, a crucial heuristic.
-            None => return 1.0 - self.uni_total_log10 - word.len() as f64,
+            None => {
+                let mut word_count = word.len() as f64 / 5.0;
+                if word_count < 1.0 {
+                    word_count = 1.0
+                }
+
+                return (1.0 - self.uni_total_log10) * word_count
+            },
         };

which - I think! - makes some sort of sense in the probablistic framework; whereas I can't find an intuition to match the current heuristic.

And it generates better solutions, faster, for my silly puzzle.

What do you think?

djc commented 9 months ago

It's tricky -- the algorithm we're using here is not novel (that is, it comes from a book chapter by Peter Norvig). As such, although I ported the existing algorithm to Rust, I don't have deep expertise in refining the algorithm.

However, your refinement does make some sense -- have you checked how it affects the benchmarks? Presumably non-words would be somewhat rare in most inputs.

@arnoldkalmbach do you have any thoughts on this?

dimbleby commented 9 months ago

the version proposed above does modest damage to the benchmarks by about 10-15% on my computer.

But here's a variation that actually seems to improve performance, if anything:

--- a/instant-segment/src/lib.rs
+++ b/instant-segment/src/lib.rs
@@ -105,7 +105,11 @@ impl Segmenter {
             Some((uni, bi_scores)) => (uni, bi_scores),
             // Penalize words not found in the unigrams according
             // to their length, a crucial heuristic.
-            None => return 1.0 - self.uni_total_log10 - word.len() as f64,
+            None => {
+                let word_len = word.len() as f64;
+                let word_count = word_len / 5.0;
+                return (1.0 - self.uni_total_log10 - word_len) * word_count
+            },
         };

that doesn't do anything much on the existing benchmarks, but if I add a new "extra long" one with the text itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishnessitwastheepochofbeliefitwastheepochofincredulityitwastheseasonoflightitwastheseasonofdarknessitwasthespringofhopeitwasthewinterofdespair - performance on that improves by about 10%

I'm sure you are right that long non-words don't appear in most texts, what I am doing is surely a niche use case.

djc commented 9 months ago

I'd review a PR along those lines. Please also add a bunch of context to the comment above.

dimbleby commented 9 months ago

54