Closed squeaktoy closed 1 month ago
I told you on IRC that I prefer using only lower-case strings and prefer not to have this comparison operator. So ... what's up?
I thought you said that either method could be done, but that you preferred the lowercase strings. The problem with making the key lowercase right before the comparison is that it has more gotchas than this method. A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work. With this commit, the way the comparison is made is changed altogether, and as such it won't have gotchas like that. It also doesn't look like a hack, while making a string lowercase before comparing (which requires putting a const char array into a new buffer) does look like a hack.
A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work
That's already true, isn't it?
Are there any more "gotchas"? Because that's the only one you named.
while making a string lowercase before comparing (which requires putting a const char array into a new buffer) does look like a hack
Not to me. You need a small fixed-size buffer for this - we're talking about strings that are probably all 4 characters or less - if a string is longer than that, you can skip the search completely. This is so simple, that's the reason I prefer using just lower-case everywhere.
But what looks like a hack indeed is this ugly mess that this struct IgnoreCaseComparator
is - much more complicated than it needs to be. I mean this:
template <typename T>
concept char_ptr = std::same_as<std::decay_t<T>, const char *> ||
std::same_as<std::decay_t<T>, char *>;
template <typename T>
concept not_char_ptr = !char_ptr<T>;
struct IgnoreCaseComparator {
using is_transparent = void;
template<not_char_ptr T, not_char_ptr U>
bool operator()(const T &a, const U &b) const {
return std::lexicographical_compare(a.begin(), a.end(),
b.begin(), b.end(),
[](char left, char right) {
return std::tolower(left) < std::tolower(right);
}
);
}
template<char_ptr T, not_char_ptr U>
bool operator()(const T &left, const U &right) const {
return (*this)(std::string_view(left), right);
}
template<not_char_ptr T, char_ptr U>
bool operator()(const T &left, const U &right) const {
return (*this)(left, std::string_view(right));
}
};
is equivalent to:
struct IgnoreCaseComparator {
using is_transparent = void;
bool operator()(std::string_view a, std::string_view b) const {
return std::lexicographical_compare(a.begin(), a.end(),
b.begin(), b.end(),
[](char left, char right) {
return std::tolower(left) < std::tolower(right);
}
);
}
};
... because instead of this juggling with C++20 concepts and method specializations, you can just let the compiler convert stuff implicitly to std::string_view
. But yeah C++ allows you to put the complexity slider up to eleven, if only you want to.
A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work
That's already true, isn't it?
When using a const char array with suffixes, the comparisons are already case-insensitive, no matter if the suffixes in the char array are uppercase or lowercase. I just tested this by changing s3m
to S3M
in the openmpt decoder plugin's suffix array and updating the database. I was able to see files ending in s3m, s3M, and S3M. So no, that's not true. If I were to go with your idea of modifying the key before the comparison, then I'd introduce this gotcha which currently doesn't exist.
Another gotcha with your proposal is that with a fixed buffer, suddenly we introduce some kind of limit to how long a file extension is allowed to be before some kind of bug gets triggered. I don't really like the idea of introducing arbitrary limits in buffers, especially when C++ can abstract those away. While it's possible to copy a const char array into a std::string, iirc that puts it on the heap. As far as I can understand the code in my commit, it uses string_view and so I would expect it to be more efficient than copying char arrays or strings around, without the downside that an arbitrary buffer size limit would have like in your proposal.
I just modified IgnoreCaseComparator to be simpler and it works on my end.
A decoder plugin would have to make sure that the std::set contains only lowercase suffixes or the whole thing won't work
That's already true, isn't it?
When using a const char array with suffixes, the comparisons are already case-insensitive, no matter if the suffixes in the char array are uppercase or lowercase. I just tested this by changing
s3m
toS3M
in the openmpt decoder plugin's suffix array and updating the database. I was able to see files ending in s3m, s3M, and S3M. So no, that's not true.
What? You changed a plugin to emit upper-case suffixes to disprove my assertion that the suffixes are always lower-case? I don't get it. This makes no sense at all.
If I were to go with your idea of modifying the key before the comparison, then I'd introduce this gotcha which currently doesn't exist.
You created an artificial gotcha and then complain that the gotcha you created exists. I don't get it.
Another gotcha with your proposal is that with a fixed buffer, suddenly we introduce some kind of limit to how long a file extension is allowed to be before some kind of bug gets triggered.
What bug gets triggered? I don't get it.
I don't really like the idea of introducing arbitrary limits in buffers, especially when C++ can abstract those away.
Which comes at a cost.
While it's possible to copy a const char array into a std::string, iirc that puts it on the heap.
This is the cost. Making things dynamic is a non-zero-cost abstraction. But this argument is a straw man. Nobody suggested using a std::string
.
As far as I can understand the code in my commit, it uses string_view and so I would expect it to be more efficient than copying char arrays or strings around, without the downside that an arbitrary buffer size limit would have like in your proposal.
Your expectation is wrong. I tried benchmarking your code against my idea; inserting a million strings into a SuffixSet, and then doing ten million lookups. This is your version:
8,643.26 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
12,509 page-faults:u # 1.447 K/sec
22,282,246,448 cycles:u # 2.578 GHz
29,041,683,748 instructions:u # 1.30 insn per cycle
6,464,829,174 branches:u # 747.962 M/sec
81,039,935 branch-misses:u # 1.25% of all branches
and this is mine:
3,919.86 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
19,635 page-faults:u # 5.009 K/sec
10,049,712,992 cycles:u # 2.564 GHz
10,584,204,270 instructions:u # 1.05 insn per cycle
3,097,294,132 branches:u # 790.154 M/sec
52,383,153 branch-misses:u # 1.69% of all branches
Wow, yours is slower by a factor of two!!
I expected your version to be slower, but I didn't expect it to be that slow. Of course, it's slower. My code calls std::tolower()
40 million times (10 million tests, 4 characters per looked-up key, transformation is only done once per lookup); yours calls std::tolower()
895 million times (calling it again and again for both the looked-up key and the current tree node, for each level of the tree being traversed). To me, it was obvious that it must be slower.
What? You changed a plugin to emit upper-case suffixes to disprove my assertion that the suffixes are always lower-case? I don't get it. This makes no sense at all.
You created an artificial gotcha and then complain that the gotcha you created exists. I don't get it.
What I was trying to say is that right now, there's no code that assumes that suffixes are always lowercase (outside the scope of suffixes_function), so I was trying to be compliant with that and not introduce any assumptions into my code.
Another gotcha with your proposal is that with a fixed buffer, suddenly we introduce some kind of limit to how long a file extension is allowed to be before some kind of bug gets triggered.
What bug gets triggered? I don't get it.
I was thinking that if you were to stop searching after a certain buffer size limit were reached, it might result in two files which have long file extensions that are different to be treated the same, potentially. Or that a long file extension might not be recognized just because there is a fixed buffer. This is pretty theoretical though.
I don't really like the idea of introducing arbitrary limits in buffers, especially when C++ can abstract those away.
Which comes at a cost.
Well yeah, but I was actually thinking that a comparator could prevent copying entire buffers by only lowering the case during the comparisons. But this was a rough guess, as I hadn't benchmarked it yet.
While it's possible to copy a const char array into a std::string, iirc that puts it on the heap.
This is the cost. Making things dynamic is a non-zero-cost abstraction. But this argument is a straw man. Nobody suggested using a std::string.
Thanks for pointing out the logical fallacy. I'll explain my thought process. While you didn't come up with the idea of using std::string, I personally found it to be a very simple soluition to the problem. Here's the code I had written before which might be similar to your proposal:
diff --git a/src/db/update/Walk.cxx b/src/db/update/Walk.cxx
index cafda30e0..d958dc23c 100644
--- a/src/db/update/Walk.cxx
+++ b/src/db/update/Walk.cxx
@@ -23,6 +23,7 @@
#include "Log.hxx"
#include <cassert>
+#include <cctype>
#include <cerrno>
#include <exception>
#include <memory>
@@ -187,9 +188,13 @@ UpdateWalk::UpdateRegularFile(Directory &directory,
if (suffix == nullptr)
return false;
- return UpdateSongFile(directory, name, suffix, info) ||
- UpdateArchiveFile(directory, name, suffix, info) ||
- UpdatePlaylistFile(directory, name, suffix, info);
+ std::string lower_suffix(suffix);
+ std::transform(lower_suffix.begin(), lower_suffix.end(), lower_suffix.begin(),
+ [](unsigned char c){ return std::tolower(c); });
+
+ return UpdateSongFile(directory, name, lower_suffix.c_str(), info) ||
+ UpdateArchiveFile(directory, name, lower_suffix.c_str(), info) ||
+ UpdatePlaylistFile(directory, name, lower_suffix.c_str(), info);
}
void
It's a three-line fix to the problem, but I personally disliked it because it's potentially copying a char array to a std::string in the heap, and I believed that a comparator could be faster and more clean, and also avoiding the assumption that suffixes are lowercase, without resorting to a fixed buffer.
Your expectation is wrong. I tried benchmarking your code against my idea; inserting a million strings into a SuffixSet, and then doing ten million lookups. This is your version:
Which commit? Is it the latest?
and this is mine:
Which commit or what code?
Also how did you write that benchmark? I appreciate the effort but it would be nice to know so I can do that to test my own code next time.
I was thinking that if you were to stop searching after a certain buffer size limit were reached, it might result in two files which have long file extensions that are different to be treated the same, potentially.
Remember when I wrote "if a string is longer than that, you can skip the search completely"? I already covered both cases you mentioned in my initial suggestion.
Well yeah, but I was actually thinking that a comparator could prevent copying entire buffers by only lowering the case during the comparisons. But this was a rough guess, as I hadn't benchmarked it yet.
In this case, copying the buffer isn't more expensive than lowering the case during comparison, but you're doing a lot of comparisons per lookup, but only one copy per lookup.
std::string lower_suffix(suffix); std::transform(lower_suffix.begin(), lower_suffix.end(), lower_suffix.begin(), [](unsigned char c){ return std::tolower(c); });
And this is what I explicitly said I wouldn't do. Because it allocates memory. My suggestion was to use a fixed-size buffer (on the stack) which is cheap.
Also how did you write that benchmark? I appreciate the effort but it would be nice to know so I can do that to test my own code next time.
It's a useless micro-benchmark, I only did it because I wanted to test whether you were right, and turns out my gut feeling was really correct.
#include <cctype>
#include <string>
#include <set>
#include <algorithm>
#ifdef COUNT
static unsigned long long y;
#endif
#ifdef FOO
struct IgnoreCaseComparator {
using is_transparent = void;
bool operator()(std::string_view a, std::string_view b) const {
return std::lexicographical_compare(a.begin(), a.end(),
b.begin(), b.end(),
[](char left, char right) {
#ifdef COUNT
++y;
#endif
return std::tolower(left) < std::tolower(right);
}
);
}
};
using SuffixSet = std::set<std::string, IgnoreCaseComparator>;
#else
using SuffixSet = std::set<std::string, std::less<>>;
#endif
static std::string_view ToString(char *buffer, unsigned i) noexcept
{
buffer[0] = ' ' + (i & 0x7f);
i >>= 7;
buffer[1] = ' ' + (i & 0x7f);
i >>= 7;
buffer[2] = ' ' + (i & 0x7f);
i >>= 7;
buffer[3] = ' ' + (i & 0x7f);
return std::string_view{buffer, 4};
}
static bool Contains(const SuffixSet &set, std::string_view x) noexcept
{
#ifndef FOO
char buffer[4];
if (x.size() > sizeof(buffer))
return false;
char *end = std::transform(x.begin(), x.end(), buffer, [](char ch){
#ifdef COUNT
++y;
#endif
return tolower(ch);
});
x = {buffer, (size_t)(buffer-end)};
#endif
return set.find(x) != set.end();
}
int main() {
SuffixSet set;
for (std::size_t i = 0; i < 1000000; ++i) {
char buffer[4];
set.emplace(ToString(buffer, i));
}
int x = 0;
for (std::size_t i = 0; i < 10000000; ++i) {
char buffer[4];
if (Contains(set, ToString(buffer, i % 1200000)))
++x;
}
#ifdef COUNT
fprintf(stderr, "y=%llu\n", y);
#endif
return x;
}
Previously, when using a suffixes function in a decoder plugin, suffixes were directly compared in a case-sensitive manner, so a file ending in e.g. .mp3 would not be treated the same as .MP3, and then no decoder plugin might be found for the .MP3 file.
This commit introduces a comparator that makes comparisons ignore case by always lowering the case on both inputs.
The long and tiring std::set<std::string, std::less<>> is replaced with DecoderPlugin::SuffixSet so suffixes functions no longer have to depend on an arbitrary type that may change in the future, and can from now on use this abstraction.
Credit for the comparator goes to: Kevin "Alipha" Spinar, from #C++ on libera.chat