Open Phyks opened 8 years ago
What do you mean by "full text search"? It sounds like you might be looking for the fuzzy plugin.
I mean something as what ElasticSearch does, or SQL MATCH
or even Google. Just typing some search and getting back results with an associated pertinence score.
For my particular use case, it would allow me to search for "Michael Jackson - Black or white (original version) 1999" directly in beet and still get a result, whereas the search now does not return any result because it cannot match everything.
It would also be resilient to typos I think.
That certainly sounds like what the fuzzy plugin:
The
fuzzy
plugin provides a prefixed query that searches your library using fuzzy pattern matching. This can be useful if you want to find a track with complicated characters in the title.
You can adjust the threshold (i.e. sensitivity) in your configuration as well.
Indeed, I missed it, my bad.
Still, it seems it is only looking at track title and not performing a query on every available field as regular ls
do. Or (more likely) I missed something…
I believe the plugin should query the standard set of fields, unless you tell it not to (as with ordinary queries).
Hmm, I have in my config:
plugins: … fuzzy
fuzzy:
prefix: "~"
threshold: 0.8
(I set a higher threshold to get less false positives)
Then, beet -l good.blb -c ~/.beets/base.conf.yaml ls "~michael"
returns anything but musics from "Michael Jackson". When looking at the results, some are definitely coming from a song title match (exact match with michael, or songs named "camel" or stuff like this), some are coming from an album name match ("Miracles" for instance) but I do not see anyone obviously coming from a match in artist name.
Note that the standard search beet -l good.blb -c ~/.beets/base.conf.yaml ls "michael"
does return my Michael Jackson discography.
Lowering the threshold increases a lot the number of false positives, but does not seem to give any Michael Jackson song either.
I think the similarity is proportional to the entire field, not just a substring. Have you tried something like ~'michael jackso'
?
Indeed, it is the case. And using ~"michael jackso"
works. Though, it no longer works if I use more sophisticated string like ~"mickael jackson - black or white"
.
What I am looking for is the same thing as SQL MATCH
syntax:
> SELECT artist.name AS artist, album.name AS album, title, MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) AS score FROM song LEFT JOIN artist ON song.artist = artist.id LEFT JOIN album ON song.album = album.id WHERE MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) ORDER BY score DESC LIMIT 1;
+-----------------+---------------------------+----------------+-------+
| artist | album | title | score |
+-----------------+---------------------------+----------------+-------+
| Michael Jackson | Essential Michael Jackson | Black or White | 4 |
+-----------------+---------------------------+----------------+-------+
I am not sure whether it exists or not already in beets, or if it could be of any use to anyone else than me?
Sure, it might make sense to extend the fuzzy plugin for this purpose. Substring fuzzy matching is probably more intuitive anyway. Would that make sense for you?
Yes, I think that would do it.
Cool! I've updated the title to reflect that idea.
Sorry for the necro, but I just wanted to say this would be immensely helpful. I don't think it would be too difficult to implement, either.
Off the top of my head, I think the way to do this would be to change how the ratio threshold is calculated. The current implementation uses difflib
to get an upper bound on the similarity between a query and a database entry:
Looking at difflib
this ratio is defined as 2.0 * matched_characters / (len(pattern) + len(val))
. Notably, matched_characters <= min(len(pattern) + len(val))
, so if pattern
is 1/10th the size of val
the highest match you're going to get is 1/ (10 + 1) = 0.09
.
What I propose is that the threshold calculation be changed to:
threshold = config["fuzzy"]["threshold"].as_number()
if len(pattern) < len(val):
max_possible_ratio = len(pattern) / (len(pattern) + len(val))
threshold *= max_possible_ratio
This should not impact performance at all and should solve this issue. Happy to put up a PR!
Code snippet I provided was slightly wrong, should be 2 * len(pattern) / (len(pattern) + len(val))
.
I also noticed that using quick_ratio
alone can result in too much matching (it's an upper bound, so it can result in the algo being too lenient), so I changed the method to calculate the exact ratio if quick_ratio
meets the threshold. This should improve accuracy with minimal performance cost.
Hi,
For a project I have, I need to match Youtube video titles against my own music collection managed by beets. Problem is Youtube videos have very different titles, and often have extra noise like "(official video !!!)" which prevents from using
beet ls
directly.I came up with some heuristics to sanitize them, but still, this is not really reliable.
I am not sure if anyone already did it (but I could not find it) or if it might be interesting either for beet or anyone here to have a full text search in beet?
In case it might be interesting, either to be merged or as a plugin, I am open to any feedback or advice. I was planning on using
whoosh
for my particular prototyping case.Thanks