Open abubelinha opened 8 months ago
The score depends on both the pattern and the strings being matched such that the longer the pattern is, the higher the score will be for strings that match. For example, with an pattern that's got 2500 ascii chars, strings that match against it will have a score of roughly 65,010:
import polars as pl
from polars_fuzzy_match import fuzzy_match_score
long_str = "".join(["f"] * 5000) # len 5000
df = pl.DataFrame(
{
"strs": [long_str, "alice", "bob"],
}
)
long_pattern = "".join(["f"] * 2500) # len 2500
out = (
df.with_columns(
length=pl.col("strs").str.len_bytes(),
score=fuzzy_match_score(
pl.col("strs"),
long_pattern,
),
)
.filter(pl.col("score").is_not_null())
.sort(by="score", descending=True)
)
print(out)
This outputs:
┌───────────────────────────────────┬────────┬───────┐
│ strs ┆ length ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 │
╞═══════════════════════════════════╪════════╪═══════╡
│ ffffffffffffffffffffffffffffffff… ┆ 5000 ┆ 65010 │
└───────────────────────────────────┴────────┴───────┘
However, the score will never exceed 65,355 (max of an unsigned 16 bit integer). The best way to determine closeness in a manner that's independent of the length/complexity of the pattern is to filter out the null scores (as those don't match at all) and to order by score. There are some additional configurations that I haven't documented yet such as how to configure handling of uppercase characters and so on.
Looking to the basic example: The string list is
['foo', 'foo quz BAR', 'baaarfoo', 'quz']
There you say:
What is the highest expected score, so I can decide the value is "close enough" to the pattern? It turns out that longer patterns get higher scores:
pattern = 'baaarfoo'
outputs:pattern = 'foo quz BAR'
outputs:pattern = 'quz'
outputs:I would expect identical strings should have the same top score (i.e. 100, for example), no matter how long they are. Also, in the last example, I would expect
'quz'
the same top score (i.e. 100) and'foo quz BAR'
to score less than the top (i.e., containing the string is not the same fully matching it).Are they any config parameters so I can make the library behave like that? Thanks
@abubelinha