dcramer / peated

https://peated.com
Apache License 2.0
63 stars 13 forks source link

Bottle Naming Master Ticket #191

Closed dcramer closed 1 month ago

dcramer commented 1 month ago

Dumping this all in one place.

The way we want bottles in the system is a little subjective, so we're trying to define rules that can both be easily managed by an individual, as well as code.

This is a rough set of rules that I'd say were the goal, but keep reading as I think its broken by design.

  1. Must not include the brand. We have a separate attribute to track the brand name.
  2. Should not include the vintage (1989) or category (Single Malt), except when no other descriptors are available.
  3. Should not include generalized information that is otherwise not key to bottle identification. An example of something that should be excluded is the "Limited Edition".
  4. Must include the age statement when its a key component. The age statement should be written as AGE-years-old. (TODO: this needs better defined).
  5. Must not include the regional information. For example, "Single Malt Scotch", the "Scotch" term should never be included.

Wrote a bit about this problem here: https://cra.mr/cta-structuring-unstructured-data

Now, the more I look at the data, the more I dont think this is going to work. I figured I'd build a manual moderation queue, match them up by hand (its not that much data), and the problem would be solved-enough. I'm finding that while going through the queue theres even more complexity.

Here's a few annoying examples:

Balcones Peated Texas Single Malt - note we've removed "Whiskey" already from it as its noise

One thought I had that would at least allow us to mostly dedupe, is the following:

Take all of "components" of the bottle name, and create a branch ranked set of heuristics.

Baclones Peated Texas, Single Malt Whiskey

Brand is special cased, and always present as the first token. This is a solved problem for the most part.

My thinking is that we build a set of token matchers based on a priority of these attributes. That is:

1) The name, by default, excludes any attribute (age, category, generic descriptors) 2) If there is no name, we prioritize attributes (statedAge, category) - need to test this to see if thats enough attributes, or if e.g. we also have to consider something like vintageYear 3) Generic descriptors are never present.

We'd have to build a list of these generic descriptors but I think we'd cover 95% pretty easily. Things like "Whiskey", "Scotch Whisky", or "Limited Edition" are easy to grab.

Now there are still some challenges here, particularly around tokenization. We would likely have to build up our known tokens (brand, category, descriptors) and extract those first. Then we'd simply tokenize the rest using a standard word-token approach (at least thats my initial thought).

So in the above, we'd need to make sure we can match to the following bottle labels, in both directions, no matter the input:

Even with the above, there's still a problem. Using the heuristics I described, the system would think this bottle is "Balcones Peated Texas". We obviously don't want that, which means we'd have to add a manual heuristic to pull out "Texas Single Malt". That seems a little tedious but I doubt there's many of that.

Lets say we do that, now we have "Balcones Peated". Is that ok? It seems like it might be, but I'm not sure its desirable. The alternative is we'd want it to be "Balcones Peated Texas Single Malt", which creates a world of other problems.

That's where I am right now.

dcramer commented 1 month ago

Closing #55 in favor of this

ericshively commented 1 month ago

“What is a blended Bourbon? why do we have Blends for whiskey but not bourbon?”

All bourbons are blended. They’re >51% corn by definition. All have 5-10% barley for a certain flavor it gives. And they exchange the rest between more corn and rye/wheat whether they want it sweeter or spicier/smoother.

Whiskeys are distinguished blended because they could also be only one grain.

Your heuristics seem good but you’d want “Balcones Peated Single Malt.” It seems like you should have a tier ranking of attributes. If no name, use age (high priority attributes), if no age, use flavor and malt (medium priority), etc.

dcramer commented 1 month ago

“What is a blended Bourbon? why do we have Blends for whiskey but not bourbon?”

All bourbons are blended. They’re >51% corn by definition. All have 5-10% barley for a certain flavor it gives. And they exchange the rest between more corn and rye/wheat whether they want it sweeter or spicier/smoother.

Whiskeys are distinguished blended because they could also be only one grain.

Actually in hindsight I think I naively looked at blending to always mean multi-cask and generilizing that in my head to focus on multi-distillery, so when I see "this bourbon is blended from 3 distilleries with variable ages" I jump to "well thats totally different than a single distillery or same-aged casks".

Your heuristics seem good but you’d want “Balcones Peated Single Malt.” It seems like you should have a tier ranking of attributes. If no name, use age (high priority attributes), if no age, use flavor and malt (medium priority), etc.

This is where I keep landing. In fact I chose to already put one foot forward by saying if the age statement is present in the name, its always the first attribute. e.g. "Aberfeldy 18-year-old Good Cask" will always be in that order as long as the two components are [age Statement] and [Good Cask].

dcramer commented 1 month ago

Some more thinking based on some changes I made recently:

dcramer commented 1 month ago

Once again, closing this in favor of #210 as we're taking a new approach and this whole goal will shift slightly.