beetbox / beets

music library manager and MusicBrainz tagger
http://beets.io/
MIT License
12.73k stars 1.82k forks source link

String queries should be case-insensitive even for non-ASCII characters #3160

Closed netvl closed 5 years ago

netvl commented 5 years ago

Problem

When searching for metadata written in English or, in other words, in latin script, beet ls performs case-insensitive search, which is very nice. With non-latin scripts (I've checked with cyrillic, since it is the only one in which I have tracks and which has the case concept), however, case-insensitive search does not work.

Running this command:

$ beet ls сплин

outputs nothing, while this one:

$ beet ls Сплин

shows all the tagged files correctly.

I would expect this to work for any kind of scripts, probably based on unicode case transformations.

Setup

My configuration (output of beet config) is:

lyrics:
    bing_lang_from: []
    auto: yes
    bing_client_secret: REDACTED
    bing_lang_to:
    google_API_key: REDACTED
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    force: no
    local: no
    sources:
    - google
    - lyricwiki
    - musixmatch
    - genius
directory: ~/Music
library: ~/.local/share/beets/music.db
sort_item: artist+ year+ album+ disc+ track+
original_date: yes

ui:
    color: yes

import:
    move: no
    log: ~/.local/share/beets/import.log
    incremental: yes

paths:
    default: $albumartist/$year - $album%aunique{albumartist year album}/%if{$multidisc,Disc $disc/}$track - $title
    comp: Compilations/$album%aunique{}/$track - $title
    singleton: Singletons/$artist - $title
item_fields:
    multidisc: 'True if disctotal > 1 else False

        '

plugins: chroma fromfilename edit embedart fetchart lyrics zero bpd convert info export alternatives types inline web
chroma:
    auto: no
embedart:
    compare_threshold: 50
    maxwidth: 0
    auto: yes
    ifempty: no
    remove_art_file: no
fetchart:
    store_source: yes
    auto: yes
    minwidth: 0
    maxwidth: 0
    enforce_ratio: no
    cautious: no
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
    sources:
    - filesystem
    - coverart
    - itunes
    - amazon
    - albumart
    google_key: REDACTED
    google_engine: 001442825323518660753:hrh5ch1gjzm
    fanarttv_key: REDACTED
zero:
    fields: month day
    auto: yes
    keep_fields: []
    update_database: no
bpd:
    host: 127.0.0.1
    port: 6600
    password: REDACTED
    volume: 100
convert:
    never_convert_lossy_files: yes
    format: opus
    dest:
    pretend: no
    threads: 12
    formats:
        aac:
            command: ffmpeg -i $source -y -vn -acodec aac -aq 1 $dest
            extension: m4a
        alac:
            command: ffmpeg -i $source -y -vn -acodec alac $dest
            extension: m4a
        flac: ffmpeg -i $source -y -vn -acodec flac $dest
        mp3: ffmpeg -i $source -y -vn -aq 2 $dest
        opus: ffmpeg -i $source -y -vn -acodec libopus -ab 96k $dest
        ogg: ffmpeg -i $source -y -vn -acodec libvorbis -aq 3 $dest
        wma: ffmpeg -i $source -y -vn -acodec wmav2 -vn $dest
    max_bitrate: 500
    auto: no
    tmpdir:
    quiet: no
    embed: yes

    paths: {}
    no_convert: ''
    copy_album_art: no
    album_art_maxwidth: 0
types:
    phone_export: bool
alternatives:
    phone:
        directory: ~/tmp/phone-music
        formats: opus mp3 ogg
        query: phone_export:true
export:
    default_format: json
    json:
        formatting:
            ensure_ascii: no
            indent: 4
            separators: [',', ': ']
            sort_keys: yes
edit:
    albumfields: album albumartist
    itemfields: track title artist album
    ignore_fields: id path
pathfields: {}
album_fields: {}
web:
    host: 127.0.0.1
    port: 8337
    cors: ''
    cors_supports_credentials: no
    reverse_proxy: no
    include_paths: no
sampsyo commented 5 years ago

Sounds good! See also #2626 for some prior discussion.

netvl commented 5 years ago

I think the the solutions has two parts: first, the user of beets should have the ICU extension for SQLite on their machine (at least for Linux systems, where system SQLite libraries are usually used). This seems to be quite obscure, unfortunately - it seems that there is no ICU extension anywhere in Archlinux packages, including AUR, which suggests than it is even worse in other distros (maybe sans Gentoo and similar distributions). All instructions on the net suggest building the extension manually.

Second, beets should load this extension when it queries the database. From what I understand, this should be an explicit action on the client side of the database. After it is done, the LIKE queries should handle character case automatically.

I wonder, could this (loading an SQLite extension) be done through a plugin? If yes, then I think adding a plugin with an optional explicit path to the extension shared object (possibly using some system directories, if it is not specified), which loads this extension, could be a nice solution.

netvl commented 5 years ago

Building sqlite with ICU module enabled does indeed help, and makes it work automatically without any changes necessary from the beets side. This kind of solves this issue specifically for me, since I can build sqlite with necessary options. But I wonder - across all platforms, how often maintainers build sqlite with ICU extension built-in?

jackwilsdon commented 5 years ago

That's great to hear @netvl, thanks for doing some digging! It's awesome to hear that beets doesn't need any changes to support this, and I've been playing around with using a dynamic library for it too.

It does seem that we can load the ICU extension into beets' SQLite database at startup using a plugin, although we do need to reach into the beets internals a bit;

from __future__ import division, absolute_import, print_function

from beets.plugins import BeetsPlugin

class LoadIcuPlugin(BeetsPlugin):
    def __init__(self):
        super(LoadIcuPlugin, self).__init__()
        self.register_listener('library_opened', self.library_opened)

    def library_opened(self, lib):
        lib._connection().load_extension('libicu.so')

It seems that we could also call db.enable_load_extension(True) inside dbcore and then just use the Transaction#query method to run SELECT load_extension("libicu.so"), which doesn't require calling any private methods (like _connection()).

I think this could be made into a simple LoadExtPlugin which takes a list of plugins to load from your beets configuration.

sampsyo commented 5 years ago

That's cool! We could even fold this into beets core if we could make it optional. Is it easy to load the plugin if it's available, and silently do nothing if it's not?

jackwilsdon commented 5 years ago

The main issue is that you need a path to the plugin, I guess we could make it a core configuration option to provide the path? I'm not sure if there's a "standardised" path for SQLite plugins.

sampsyo commented 5 years ago

Ah, of course, that makes sense! In that case, a beets plugin for loading these could be the right path.

Moonbase59 commented 4 years ago

Even with ICU enabled, there are still edge cases that won’t work, I find many with French accented letters: The song "À quoi ça sert l'amour (avec Théo Sarapo)" by "Édith Piaf" cannot be found using these:

beet ls edith piaf
beet ls Edith piaf
beet ls a quoi ca

but will be found using these:

beet ls édith piaf
beet ls Édith piaf
beet ls a quoi ça
beet ls A QUOI ÇA

It appears that an accented À is handled differently from an accented É, ditto for ç. This is not a beets problem but a problem with SQLite’s handling of ICU, LIKE and level 1 collation.