fuzzy: Match substrings fuzzily

Phyks commented 8 years ago

Hi,

For a project I have, I need to match Youtube video titles against my own music collection managed by beets. Problem is Youtube videos have very different titles, and often have extra noise like "(official video !!!)" which prevents from using beet ls directly.

I came up with some heuristics to sanitize them, but still, this is not really reliable.

I am not sure if anyone already did it (but I could not find it) or if it might be interesting either for beet or anyone here to have a full text search in beet?

In case it might be interesting, either to be merged or as a plugin, I am open to any feedback or advice. I was planning on using whoosh for my particular prototyping case.

Thanks

jackwilsdon commented 8 years ago

What do you mean by "full text search"? It sounds like you might be looking for the fuzzy plugin.

Phyks commented 8 years ago

I mean something as what ElasticSearch does, or SQL MATCH or even Google. Just typing some search and getting back results with an associated pertinence score.

For my particular use case, it would allow me to search for "Michael Jackson - Black or white (original version) 1999" directly in beet and still get a result, whereas the search now does not return any result because it cannot match everything.

It would also be resilient to typos I think.

jackwilsdon commented 8 years ago

That certainly sounds like what the fuzzy plugin:

The fuzzy plugin provides a prefixed query that searches your library using fuzzy pattern matching. This can be useful if you want to find a track with complicated characters in the title.

You can adjust the threshold (i.e. sensitivity) in your configuration as well.

Phyks commented 8 years ago

Indeed, I missed it, my bad.

Still, it seems it is only looking at track title and not performing a query on every available field as regular ls do. Or (more likely) I missed something…

sampsyo commented 8 years ago

I believe the plugin should query the standard set of fields, unless you tell it not to (as with ordinary queries).

Phyks commented 8 years ago

Hmm, I have in my config:

plugins: … fuzzy
fuzzy:
    prefix: "~"
    threshold: 0.8

(I set a higher threshold to get less false positives)

Then, beet -l good.blb -c ~/.beets/base.conf.yaml ls "~michael" returns anything but musics from "Michael Jackson". When looking at the results, some are definitely coming from a song title match (exact match with michael, or songs named "camel" or stuff like this), some are coming from an album name match ("Miracles" for instance) but I do not see anyone obviously coming from a match in artist name.

Note that the standard search beet -l good.blb -c ~/.beets/base.conf.yaml ls "michael" does return my Michael Jackson discography.

Lowering the threshold increases a lot the number of false positives, but does not seem to give any Michael Jackson song either.

sampsyo commented 8 years ago

I think the similarity is proportional to the entire field, not just a substring. Have you tried something like ~'michael jackso'?

Phyks commented 8 years ago

Indeed, it is the case. And using ~"michael jackso" works. Though, it no longer works if I use more sophisticated string like ~"mickael jackson - black or white".

What I am looking for is the same thing as SQL MATCH syntax:

> SELECT artist.name AS artist, album.name AS album, title, MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) AS score FROM song LEFT JOIN artist ON song.artist = artist.id LEFT JOIN album ON song.album = album.id WHERE MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) ORDER BY score DESC LIMIT 1;
+-----------------+---------------------------+----------------+-------+
| artist          | album                     | title          | score |
+-----------------+---------------------------+----------------+-------+
| Michael Jackson | Essential Michael Jackson | Black or White |     4 |
+-----------------+---------------------------+----------------+-------+

I am not sure whether it exists or not already in beets, or if it could be of any use to anyone else than me?

sampsyo commented 8 years ago

Sure, it might make sense to extend the fuzzy plugin for this purpose. Substring fuzzy matching is probably more intuitive anyway. Would that make sense for you?

Phyks commented 8 years ago

Yes, I think that would do it.

sampsyo commented 8 years ago

Cool! I've updated the title to reflect that idea.

carreter commented 6 months ago

Sorry for the necro, but I just wanted to say this would be immensely helpful. I don't think it would be too difficult to implement, either.

Off the top of my head, I think the way to do this would be to change how the ratio threshold is calculated. The current implementation uses difflib to get an upper bound on the similarity between a query and a database entry:

https://github.com/beetbox/beets/blob/dae525741bc43f658916b03250d95a4b8e6aeef0/beetsplug/fuzzy.py#L26-L34

Looking at difflib this ratio is defined as 2.0 * matched_characters / (len(pattern) + len(val)). Notably, matched_characters <= min(len(pattern) + len(val)), so if pattern is 1/10th the size of val the highest match you're going to get is 1/ (10 + 1) = 0.09.

What I propose is that the threshold calculation be changed to:

threshold = config["fuzzy"]["threshold"].as_number()
if len(pattern) < len(val):
    max_possible_ratio = len(pattern) / (len(pattern) + len(val))
    threshold *= max_possible_ratio

This should not impact performance at all and should solve this issue. Happy to put up a PR!

carreter commented 6 months ago

Code snippet I provided was slightly wrong, should be 2 * len(pattern) / (len(pattern) + len(val)).

I also noticed that using quick_ratio alone can result in too much matching (it's an upper bound, so it can result in the algo being too lenient), so I changed the method to calculate the exact ratio if quick_ratio meets the threshold. This should improve accuracy with minimal performance cost.

beetbox / beets

fuzzy: Match substrings fuzzily #2043