Closed nalgeon closed 6 months ago
After that, I created a table with 1M random strings using the following script:
create table strings as
select cast(random() as text) as val
from generate_series(1, 1000000);
Then I re-run the benchmark with the following query:
select count(*) from strings where val regexp "1[3-4]";
And got the following results:
Benchmark 1: ./sqlite-regex.sh
Time (mean ± σ): 881.3 ms ± 6.5 ms [User: 869.6 ms, System: 10.2 ms]
Range (min … max): 875.0 ms … 892.8 ms 10 runs
Benchmark 2: ./sqlean-re.sh
Time (mean ± σ): 428.4 ms ± 6.0 ms [User: 418.6 ms, System: 6.3 ms]
Range (min … max): 422.0 ms … 438.0 ms 10 runs
Summary
'./sqlean-re.sh' ran
2.06 ± 0.03 times faster than './sqlite-regex.sh'
Does this give me enough reason to claim that sqlite-regex
is 2x slower than sqlean-re
? I don't think so.
Hey @nalgeon - first off thank you for taking the time to verify the benchmarks, I really do appreciate it. They only get better with the more people actually test it.
Second, I do apologize for the wrong regex when benchmarking regexp/sqlite-re - I'm not the best with regular expression languages, so I just went with the first thing that compiled correctly for me. That's obviously wrong, so I started to a PR to update them with the more correct [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]
pattern: https://github.com/asg017/sqlite-regex/pull/4
However, I'm having trouble reproducing your results. On the benchmarks-improve
branch under benchmarks/dates/
I get:
Benchmark #1: ./sqlite-regex.sh
Time (mean ± σ): 54.4 ms ± 1.1 ms [User: 43.5 ms, System: 5.8 ms]
Range (min … max): 52.2 ms … 57.5 ms 50 runs
Benchmark #2: ./regexp.sh
Time (mean ± σ): 124.7 ms ± 1.3 ms [User: 113.7 ms, System: 5.8 ms]
Range (min … max): 123.3 ms … 127.2 ms 22 runs
Benchmark #3: ./sqlean-re.sh
Time (mean ± σ): 219.8 ms ± 2.0 ms [User: 208.0 ms, System: 6.1 ms]
Range (min … max): 215.5 ms … 222.2 ms 13 runs
Summary
'./sqlite-regex.sh' ran
2.29 ± 0.05 times faster than './regexp.sh'
4.04 ± 0.09 times faster than './sqlean-re.sh'
The 219ms
average for sqlean-re.sh
is much better than the 257ms
I was getting before, but still 4x slower than sqlite-regex.sh
(compared to 6x).
I also added your random string tests to benchmarks/random
, and I get:
Benchmark #1: ./sqlite-regex.sh
Time (mean ± σ): 171.4 ms ± 2.4 ms [User: 155.3 ms, System: 11.0 ms]
Range (min … max): 168.7 ms … 177.9 ms 17 runs
Benchmark #2: ./regexp.sh
Time (mean ± σ): 222.5 ms ± 2.4 ms [User: 206.5 ms, System: 10.9 ms]
Range (min … max): 219.0 ms … 227.5 ms 13 runs
Benchmark #3: ./sqlean-re.sh
Time (mean ± σ): 287.2 ms ± 3.1 ms [User: 270.7 ms, System: 11.3 ms]
Range (min … max): 283.3 ms … 292.0 ms 10 runs
Summary
'./sqlite-regex.sh' ran
1.30 ± 0.02 times faster than './regexp.sh'
1.68 ± 0.03 times faster than './sqlean-re.sh'
I ran this on my 2019 intel macbook, and I get similar results on a digitalocean droplet running unbuntu.
A few questions about your setup:
regex0.dylib
are you running? If you're compiling it locally, make sure to run make loadable-release
instead of make loadable
to get a release buildre.dylib
? I'm using the -O3
flag on gcc
to get a more optimized build, but perhaps there's something bettertest.db
? My instructions aren't the best, but when built the corpus
table has 206215
rows, and all the scripts return 66
:$ ./sqlite-regex.sh
66
$ ./sqlean-re.sh
66
$ ./regexp.sh
66
Any more info you can provide would be greatly appreciated. I'll also see about adding the new sqlean-regexp
extension you just published. Thank you again for your work!
The reason may be the different CPU architecture.
Which version of
regex0.dylib
are you running?
The one downloaded from v0.1.0
How are you compiling
re.dylib
? I'm using the-O3
flag ongcc
to get a more optimized build, but perhaps there's something better
I wasn't even using optimize flags:
gcc -fPIC -dynamiclib -I src src/sqlite3-re.c src/re.c -o dist/re.dylib
With -O3
, I've got even better results for sqlean-re
.
How are you building
test.db
? My instructions aren't the best, but when built thecorpus
table has206215
rows, and all the scripts return66
Same here.
My point is, such narrow benchmarks are not enough to make strong claims as "sqlean-re is X times slower than sqlite-regex" (or vice versa).
Look at your results — even on your machine, they differ 2x depending on the use case:
benchmarks/dates
4.04 ± 0.09 times faster than './sqlean-re.sh'
benchmarks/random
1.68 ± 0.03 times faster than './sqlean-re.sh'
I've prepared a (simplified) reproducible build in the actionist repo.
-- sqlite-regex --
312490
Run Time: real 0.177 user 0.177318 sys 0.000000
312490
Run Time: real 0.175 user 0.174948 sys 0.000000
312490
Run Time: real 0.174 user 0.166388 sys 0.008010
-- sqlean-re --
312490
Run Time: real 0.232 user 0.228178 sys 0.004029
312490
Run Time: real 0.232 user 0.231506 sys 0.000000
312490
Run Time: real 0.231 user 0.227128 sys 0.003974
-- sqlite-regex --
312185
Run Time: real 0.282 user 0.265625 sys 0.031250
312185
Run Time: real 0.282 user 0.250000 sys 0.015625
312185
Run Time: real 0.297 user 0.281250 sys 0.031250
-- sqlean-re --
312185
Run Time: real 0.469 user 0.468750 sys 0.000000
312185
Run Time: real 0.484 user 0.468750 sys 0.015625
312185
Run Time: real 0.470 user 0.437500 sys 0.031250
-- sqlite-regex --
312737
Run Time: real 0.268 user 0.254500 sys 0.014246
312737
Run Time: real 0.274 user 0.259429 sys 0.014321
312737
Run Time: real 0.317 user 0.270525 sys 0.019288
-- sqlean-re --
312737
Run Time: real 0.536 user 0.468684 sys 0.022052
312737
Run Time: real 0.495 user 0.463322 sys 0.015018
312737
Run Time: real 0.562 user 0.470800 sys 0.017376
So according to these results, a more accurate estimation is "sqlean-re is x1.3 to x1.8 slower on this specific dataset and query". Still not enough for universal claims like the one in your docs/article, IMO.
I guess that Rust is not optimized for ARM processors because sqlite-regex
results on Apple M1 are significantly worse than on any other platform:
./sqlite3 dist/bench.db < sqlite-regex.sql
-- sqlite-regex --
312501
Run Time: real 0.967 user 0.917926 sys 0.016366
312501
Run Time: real 0.868 user 0.858967 sys 0.009061
312501
Run Time: real 0.870 user 0.860130 sys 0.009177
./sqlite3 dist/bench.db < sqlean-re.sql
-- sqlean-re --
312501
Run Time: real 0.273 user 0.264191 sys 0.008496
312501
Run Time: real 0.253 user 0.230817 sys 0.005361
312501
Run Time: real 0.231 user 0.227151 sys 0.004345
P.S. I'm sorry if my comments are offensive in any way. I'm genuinely impressed by your approach with sqlite-loadable-rs
and writing extensions in Rust — very smart!
It's a shame I did all this work for nothing, but I don't care anymore. Closing this.
@nalgeon you didn't do it for nothing! you gave us some thoughts and reminder to validate benchmarks for libraries we decided to use as we designed our SQLite for email infrastructure
@nalgeon Those are interesting, we are building regex support for our product and I did a quick PoC with sqlite-regex since it was the easiest to load into our program. Which version of Rust Regex did you use? They did a whole rewrite recently so it would be interesting to see. Here we are essentially comparing Rust Regex to other engines since this crate is mostly a small shim.
Hello Alex! I checked the claims on the Benchmarks page regarding
sqlean-re
being 6.57x slower thansqlite-regex
and found the results unreproducible. Also, the test itself is very unfair.The unfair test
I'll start with the latter, as it is independent of the hardware. You are benchmarking
sqlite-regex
with this pattern:While for
sqlean-re
, you use this pattern:These are very different patterns. You introduced eight capturing groups into the second pattern, which makes this regexp significantly slower. The first pattern, on the other hand, has zero capturing groups. I don't think you can compare these two patterns in the same benchmark.
To make the patterns roughly equivalent, the second one should be:
The unreproducible results
After changing the pattern for
sqlean-re
to be equivalent tosqlite-regex
, I ran your benchmark on MacBook Air (M1, 2020), and got these results:So much for "6.57x slower".
I don't think that the disclaimer "Benchmarks are hard and easy to game" justifies your claims about the relative performance of different regexp implementations.