dcramer / peated

https://peated.com
Apache License 2.0
63 stars 13 forks source link

Improve accuracy of matching #55

Closed dcramer closed 2 months ago

dcramer commented 1 year ago

Glenfiddich 12 Year Old Amontillado Sherry Cask Finish Scotch Whisky https://www.astorwines.com/item/48699

Glenfiddich 12yr Special Edition Amontillado Sherry Cask Finish Single Malt Scotch Whisky https://woodencork.com/collections/whiskey/products/glenfiddich-12-year-old-amontillado-sherry-cask-finish-scotch-whisky

We didnt have "Glenfiddich 12-year-old Amontillado Sherry Cask" (the preferred name) in the db, so it matched on "Glenfiddich 12-year-old".

Here's some quick thoughts on a tokenization strategy that might work:

  1. Do a prefix search on the distillery with a max tokens approach:

tokenize name, select up to N tokens to search for distillery e.g.

name is 'Macallan 12-year-old Cool bottle'

select from entity where name = 'Macallan' OR name = 'Macallan 12-year-old' OR name = 'Macallan-12-year-old Cool';

  1. do token matching with prefix and consecutive

name is 'Macallan 12-year-old Cool bottle' in the db BUT the store lists it as 'Macallan 12-year-old Special Edition Cool bottle'

select from bottle where brand = 'Macallan';

split all bottles on tokens

[Macallan, 12-year-old, Special, Edition, Cool, Bottle]

we find consecutive matches:

[Macallan, 12-year-old], [Cool, Bottle]

In this case its the only match, so we're probably fine, but what if thats a different bottle? Probably nothing we can do (like the original problem identified here). We might be able to do a threshold so if e.g. <40% of tokens match, or the token prefix match is < 60% of the length?

Another example, what do we do if the db has:

Macallan 12-year-old Cool Bottle Macallan 12-year-old Special Edition Cool Bottle

It needs to match the latter, which would be a full token match, or longest token?

number of tokens likely wont work because it could have lots of useful words (like Single Malt). Full match in this case is fine, but what if one adds e.g. "Single Malt" to the end. Probably still longest token should work.

We will need to make it so we can override matches, and when do we, we should simply store those as alternative names on a bottle. We may also use those concerns to improve the rulesets (e.g. its possible we choose to ignore certain words, like the bottle category name, in tokenization).

dcramer commented 1 year ago

See also #51

dcramer commented 11 months ago

Another example problem:

image

image

This is a wildly wrong match. Really need to mock up the MVP of this idea.

dcramer commented 10 months ago

One problem I didn't answer last time: what happens when theres just a "best" match? I think we have to disqualify it, store it somewhere, maybe match it up later.

We'll also want to store all the name mappings so the lookups are free after the first time.

May rely on ChatGPT to help w/ low confidence matches too.

dcramer commented 2 months ago

Need to dig in more into things like BM25 and see if they're a more efficient and effective alternative to embeddings for finding similar matches

https://en.wikipedia.org/wiki/Okapi_BM25#:~:text=BM25%20is%20a%20bag%2Dof,their%20proximity%20within%20the%20document.

Next step though at this point is building a fast accessible moderation queue and removing the non-precise upserts.