bnm3k / polars-fuzzy-match

Polars extension for fzf-style fuzzy matching
MIT License
20 stars 2 forks source link

explain scoring #5

Open abubelinha opened 8 months ago

abubelinha commented 8 months ago

Looking to the basic example: The string list is ['foo', 'foo quz BAR', 'baaarfoo', 'quz']

There you say:

The higher the score, the closer the value is to the pattern

What is the highest expected score, so I can decide the value is "close enough" to the pattern? It turns out that longer patterns get higher scores:

I would expect identical strings should have the same top score (i.e. 100, for example), no matter how long they are. Also, in the last example, I would expect 'quz' the same top score (i.e. 100) and 'foo quz BAR' to score less than the top (i.e., containing the string is not the same fully matching it).

Are they any config parameters so I can make the library behave like that? Thanks

@abubelinha

bnm3k commented 8 months ago

The score depends on both the pattern and the strings being matched such that the longer the pattern is, the higher the score will be for strings that match. For example, with an pattern that's got 2500 ascii chars, strings that match against it will have a score of roughly 65,010:

import polars as pl
from polars_fuzzy_match import fuzzy_match_score

long_str = "".join(["f"] * 5000)  # len 5000
df = pl.DataFrame(
    {
        "strs": [long_str, "alice", "bob"],
    }
)
long_pattern = "".join(["f"] * 2500)  # len 2500
out = (
    df.with_columns(
        length=pl.col("strs").str.len_bytes(),
        score=fuzzy_match_score(
            pl.col("strs"),
            long_pattern,
        ),
    )
    .filter(pl.col("score").is_not_null())
    .sort(by="score", descending=True)
)
print(out)

This outputs:

┌───────────────────────────────────┬────────┬───────┐
│ strs                              ┆ length ┆ score │
│ ---                               ┆ ---    ┆ ---   │
│ str                               ┆ u32    ┆ u32   │
╞═══════════════════════════════════╪════════╪═══════╡
│ ffffffffffffffffffffffffffffffff… ┆ 5000   ┆ 65010 │
└───────────────────────────────────┴────────┴───────┘

However, the score will never exceed 65,355 (max of an unsigned 16 bit integer). The best way to determine closeness in a manner that's independent of the length/complexity of the pattern is to filter out the null scores (as those don't match at all) and to order by score. There are some additional configurations that I haven't documented yet such as how to configure handling of uppercase characters and so on.