Make suffixes functions ignore case

Previously, when using a suffixes function in a decoder plugin, suffixes were directly compared in a case-sensitive manner, so a file ending in e.g. .mp3 would not be treated the same as .MP3, and then no decoder plugin might be found for the .MP3 file.

This commit introduces a comparator that makes comparisons ignore case by always lowering the case on both inputs.

The long and tiring std::set<std::string, std::less<>> is replaced with DecoderPlugin::SuffixSet so suffixes functions no longer have to depend on an arbitrary type that may change in the future, and can from now on use this abstraction.

Credit for the comparator goes to: Kevin "Alipha" Spinar, from #C++ on libera.chat

I told you on IRC that I prefer using only lower-case strings and prefer not to have this comparison operator. So ... what's up?

I thought you said that either method could be done, but that you preferred the lowercase strings. The problem with making the key lowercase right before the comparison is that it has more gotchas than this method. A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work. With this commit, the way the comparison is made is changed altogether, and as such it won't have gotchas like that. It also doesn't look like a hack, while making a string lowercase before comparing (which requires putting a const char array into a new buffer) does look like a hack.

A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work

That's already true, isn't it?

Are there any more "gotchas"? Because that's the only one you named.

while making a string lowercase before comparing (which requires putting a const char array into a new buffer) does look like a hack

Not to me. You need a small fixed-size buffer for this - we're talking about strings that are probably all 4 characters or less - if a string is longer than that, you can skip the search completely. This is so simple, that's the reason I prefer using just lower-case everywhere.

But what looks like a hack indeed is this ugly mess that this struct IgnoreCaseComparator is - much more complicated than it needs to be. I mean this:

template <typename T>
concept char_ptr = std::same_as<std::decay_t<T>, const char *> ||
           std::same_as<std::decay_t<T>, char *>;

template <typename T>
concept not_char_ptr = !char_ptr<T>;

struct IgnoreCaseComparator {
    using is_transparent = void;

    template<not_char_ptr T, not_char_ptr U>
    bool operator()(const T &a, const U &b) const {
        return std::lexicographical_compare(a.begin(), a.end(),
            b.begin(), b.end(),
            [](char left, char right) {
                return std::tolower(left) < std::tolower(right);
            }
        );
    }

    template<char_ptr T, not_char_ptr U>
    bool operator()(const T &left, const U &right) const {
        return (*this)(std::string_view(left), right);
    }

    template<not_char_ptr T, char_ptr U>
    bool operator()(const T &left, const U &right) const {
        return (*this)(left, std::string_view(right));
    }
};

is equivalent to:

struct IgnoreCaseComparator {
    using is_transparent = void;

    bool operator()(std::string_view a, std::string_view b) const {
        return std::lexicographical_compare(a.begin(), a.end(),
                        b.begin(), b.end(),
            [](char left, char right) {
                return std::tolower(left) < std::tolower(right);
            }
        );
    }
};

... because instead of this juggling with C++20 concepts and method specializations, you can just let the compiler convert stuff implicitly to std::string_view. But yeah C++ allows you to put the complexity slider up to eleven, if only you want to.

A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work

That's already true, isn't it?

When using a const char array with suffixes, the comparisons are already case-insensitive, no matter if the suffixes in the char array are uppercase or lowercase. I just tested this by changing s3m to S3M in the openmpt decoder plugin's suffix array and updating the database. I was able to see files ending in s3m, s3M, and S3M. So no, that's not true. If I were to go with your idea of modifying the key before the comparison, then I'd introduce this gotcha which currently doesn't exist.

Another gotcha with your proposal is that with a fixed buffer, suddenly we introduce some kind of limit to how long a file extension is allowed to be before some kind of bug gets triggered. I don't really like the idea of introducing arbitrary limits in buffers, especially when C++ can abstract those away. While it's possible to copy a const char array into a std::string, iirc that puts it on the heap. As far as I can understand the code in my commit, it uses string_view and so I would expect it to be more efficient than copying char arrays or strings around, without the downside that an arbitrary buffer size limit would have like in your proposal.

I just modified IgnoreCaseComparator to be simpler and it works on my end.

A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work

That's already true, isn't it?

When using a const char array with suffixes, the comparisons are already case-insensitive, no matter if the suffixes in the char array are uppercase or lowercase. I just tested this by changing s3m to S3M in the openmpt decoder plugin's suffix array and updating the database. I was able to see files ending in s3m, s3M, and S3M. So no, that's not true.

What? You changed a plugin to emit upper-case suffixes to disprove my assertion that the suffixes are always lower-case? I don't get it. This makes no sense at all.

If I were to go with your idea of modifying the key before the comparison, then I'd introduce this gotcha which currently doesn't exist.

You created an artificial gotcha and then complain that the gotcha you created exists. I don't get it.

Another gotcha with your proposal is that with a fixed buffer, suddenly we introduce some kind of limit to how long a file extension is allowed to be before some kind of bug gets triggered.

What bug gets triggered? I don't get it.

I don't really like the idea of introducing arbitrary limits in buffers, especially when C++ can abstract those away.

Which comes at a cost.

While it's possible to copy a const char array into a std::string, iirc that puts it on the heap.

This is the cost. Making things dynamic is a non-zero-cost abstraction. But this argument is a straw man. Nobody suggested using a std::string.

As far as I can understand the code in my commit, it uses string_view and so I would expect it to be more efficient than copying char arrays or strings around, without the downside that an arbitrary buffer size limit would have like in your proposal.

Your expectation is wrong. I tried benchmarking your code against my idea; inserting a million strings into a SuffixSet, and then doing ten million lookups. This is your version:

          8,643.26 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
            12,509      page-faults:u                    #    1.447 K/sec                     
    22,282,246,448      cycles:u                         #    2.578 GHz                       
    29,041,683,748      instructions:u                   #    1.30  insn per cycle            
     6,464,829,174      branches:u                       #  747.962 M/sec                     
        81,039,935      branch-misses:u                  #    1.25% of all branches

and this is mine:

          3,919.86 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
            19,635      page-faults:u                    #    5.009 K/sec                     
    10,049,712,992      cycles:u                         #    2.564 GHz                       
    10,584,204,270      instructions:u                   #    1.05  insn per cycle            
     3,097,294,132      branches:u                       #  790.154 M/sec                     
        52,383,153      branch-misses:u                  #    1.69% of all branches

Wow, yours is slower by a factor of two!!

I expected your version to be slower, but I didn't expect it to be that slow. Of course, it's slower. My code calls std::tolower() 40 million times (10 million tests, 4 characters per looked-up key, transformation is only done once per lookup); yours calls std::tolower() 895 million times (calling it again and again for both the looked-up key and the current tree node, for each level of the tree being traversed). To me, it was obvious that it must be slower.

What? You changed a plugin to emit upper-case suffixes to disprove my assertion that the suffixes are always lower-case? I don't get it. This makes no sense at all.

You created an artificial gotcha and then complain that the gotcha you created exists. I don't get it.

What I was trying to say is that right now, there's no code that assumes that suffixes are always lowercase (outside the scope of suffixes_function), so I was trying to be compliant with that and not introduce any assumptions into my code.

Another gotcha with your proposal is that with a fixed buffer, suddenly we introduce some kind of limit to how long a file extension is allowed to be before some kind of bug gets triggered.

What bug gets triggered? I don't get it.

I was thinking that if you were to stop searching after a certain buffer size limit were reached, it might result in two files which have long file extensions that are different to be treated the same, potentially. Or that a long file extension might not be recognized just because there is a fixed buffer. This is pretty theoretical though.

I don't really like the idea of introducing arbitrary limits in buffers, especially when C++ can abstract those away.

Which comes at a cost.

Well yeah, but I was actually thinking that a comparator could prevent copying entire buffers by only lowering the case during the comparisons. But this was a rough guess, as I hadn't benchmarked it yet.

While it's possible to copy a const char array into a std::string, iirc that puts it on the heap.

This is the cost. Making things dynamic is a non-zero-cost abstraction. But this argument is a straw man. Nobody suggested using a std::string.

Thanks for pointing out the logical fallacy. I'll explain my thought process. While you didn't come up with the idea of using std::string, I personally found it to be a very simple soluition to the problem. Here's the code I had written before which might be similar to your proposal:

diff --git a/src/db/update/Walk.cxx b/src/db/update/Walk.cxx
index cafda30e0..d958dc23c 100644
--- a/src/db/update/Walk.cxx
+++ b/src/db/update/Walk.cxx
@@ -23,6 +23,7 @@
 #include "Log.hxx"

 #include <cassert>
+#include <cctype>
 #include <cerrno>
 #include <exception>
 #include <memory>
@@ -187,9 +188,13 @@ UpdateWalk::UpdateRegularFile(Directory &directory,
    if (suffix == nullptr)
        return false;

-   return UpdateSongFile(directory, name, suffix, info) ||
-       UpdateArchiveFile(directory, name, suffix, info) ||
-       UpdatePlaylistFile(directory, name, suffix, info);
+   std::string lower_suffix(suffix);
+   std::transform(lower_suffix.begin(), lower_suffix.end(), lower_suffix.begin(),
+       [](unsigned char c){ return std::tolower(c); });
+
+   return UpdateSongFile(directory, name, lower_suffix.c_str(), info) ||
+       UpdateArchiveFile(directory, name, lower_suffix.c_str(), info) ||
+       UpdatePlaylistFile(directory, name, lower_suffix.c_str(), info);
 }

 void

It's a three-line fix to the problem, but I personally disliked it because it's potentially copying a char array to a std::string in the heap, and I believed that a comparator could be faster and more clean, and also avoiding the assumption that suffixes are lowercase, without resorting to a fixed buffer.

Your expectation is wrong. I tried benchmarking your code against my idea; inserting a million strings into a SuffixSet, and then doing ten million lookups. This is your version:

Which commit? Is it the latest?

and this is mine:

Which commit or what code?

Also how did you write that benchmark? I appreciate the effort but it would be nice to know so I can do that to test my own code next time.

I was thinking that if you were to stop searching after a certain buffer size limit were reached, it might result in two files which have long file extensions that are different to be treated the same, potentially.

Remember when I wrote "if a string is longer than that, you can skip the search completely"? I already covered both cases you mentioned in my initial suggestion.

Well yeah, but I was actually thinking that a comparator could prevent copying entire buffers by only lowering the case during the comparisons. But this was a rough guess, as I hadn't benchmarked it yet.

In this case, copying the buffer isn't more expensive than lowering the case during comparison, but you're doing a lot of comparisons per lookup, but only one copy per lookup.

std::string lower_suffix(suffix); std::transform(lower_suffix.begin(), lower_suffix.end(), lower_suffix.begin(), [](unsigned char c){ return std::tolower(c); });

And this is what I explicitly said I wouldn't do. Because it allocates memory. My suggestion was to use a fixed-size buffer (on the stack) which is cheap.

Also how did you write that benchmark? I appreciate the effort but it would be nice to know so I can do that to test my own code next time.

It's a useless micro-benchmark, I only did it because I wanted to test whether you were right, and turns out my gut feeling was really correct.

#include <cctype>
#include <string>
#include <set>
#include <algorithm>

#ifdef COUNT
static unsigned long long y;
#endif

#ifdef FOO

struct IgnoreCaseComparator {
    using is_transparent = void;

    bool operator()(std::string_view a, std::string_view b) const {
        return std::lexicographical_compare(a.begin(), a.end(),
            b.begin(), b.end(),
            [](char left, char right) {
#ifdef COUNT
                ++y;
#endif
                return std::tolower(left) < std::tolower(right);
            }
        );
    }
};

using SuffixSet = std::set<std::string, IgnoreCaseComparator>;
#else
using SuffixSet = std::set<std::string, std::less<>>;
#endif

static std::string_view ToString(char *buffer, unsigned i) noexcept
{
    buffer[0] = ' ' + (i & 0x7f);
    i >>= 7;
    buffer[1] = ' ' + (i & 0x7f);
    i >>= 7;
    buffer[2] = ' ' + (i & 0x7f);
    i >>= 7;
    buffer[3] = ' ' + (i & 0x7f);
    return std::string_view{buffer, 4};
}

static bool Contains(const SuffixSet &set, std::string_view x) noexcept
{
#ifndef FOO
    char buffer[4];
    if (x.size() > sizeof(buffer))
        return false;

    char *end = std::transform(x.begin(), x.end(), buffer, [](char ch){
#ifdef COUNT
        ++y;
#endif
        return tolower(ch);
    });
    x = {buffer, (size_t)(buffer-end)};
#endif

    return set.find(x) != set.end();
}

int main() {
    SuffixSet set;

    for (std::size_t i = 0; i < 1000000; ++i) {
        char buffer[4];
        set.emplace(ToString(buffer, i));
    }

    int x = 0;
    for (std::size_t i = 0; i < 10000000; ++i) {
        char buffer[4];
        if (Contains(set, ToString(buffer, i % 1200000)))
            ++x;
    }

#ifdef COUNT
    fprintf(stderr, "y=%llu\n", y);
#endif
    return x;
}

MusicPlayerDaemon / MPD

Make suffixes functions ignore case #2021