digital-fabric / extralite

Ruby on SQLite
http://www.rubydoc.info/gems/extralite
MIT License
253 stars 8 forks source link

Degraded performance with multiple threads #38

Closed dbackeus closed 10 months ago

dbackeus commented 11 months ago

On the Naming Things Discord, @oldmoe mentioned that while extralite has superior performance in single threaded benchmark, it may not perform well in multi threaded scenarios. Seeing this I decided to try some benchmarking. My results appear to confirm this suggestion. But it also wouldn't surprise me if my benchmarking method could be improved.

Script

require "sqlite3"
require "extralite"
require "benchmark"

# Setup

File.delete("benchmark.sqlite3") if File.exist?("benchmark.sqlite3")

POOL_SIZE = 10

EXTRALITE_CONNECTIONS = POOL_SIZE.times.map do
  db = Extralite::Database.new("benchmark.sqlite3")
  db.execute("PRAGMA journal_mode = WAL")
  db.execute("PRAGMA synchronous = NORMAL")
  db.execute("PRAGMA journal_size_limit = 64000000")
  db.execute("PRAGMA mmap_size = 128000000")
  db.execute("PRAGMA cache_size = 2000")
  db.execute("PRAGMA busy_timeout = 5000")
  db
end

SQLITE3_CONNECTIONS = POOL_SIZE.times.map do
  db = SQLite3::Database.new("benchmark.sqlite3")
  db.execute("PRAGMA journal_mode = WAL")
  db.execute("PRAGMA synchronous = NORMAL")
  db.execute("PRAGMA journal_size_limit = 64000000")
  db.execute("PRAGMA mmap_size = 128000000")
  db.execute("PRAGMA cache_size = 2000")
  db.execute("PRAGMA busy_timeout = 5000")
  db
end

EXTRALITE_CONNECTIONS[0].execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, created_at TEXT, updated_at TEXT) STRICT")
insert_statement = EXTRALITE_CONNECTIONS[0].prepare("INSERT INTO users (name, created_at, updated_at) VALUES (?, ?, ?)")
100.times do
  insert_statement.execute("John Doe", Time.now.iso8601, Time.now.iso8601)
end

# Benchmark variations

THREAD_COUNTS = [1, 2, 4, 8]
LIMITS = [1, 10, 100]
CLIENTS = %w[extralite sqlite3]

# Benchmark

GC.disable
Benchmark.bm do |x|
  LIMITS.each do |limit|
    THREAD_COUNTS.each do |thread_count|
      CLIENTS.each do |client|
        GC.start

        x.report("#{client.rjust('extralite'.length)} - limit: #{limit}, threads: #{thread_count}") do
          threads = thread_count.times.map do |thread_number|
            Thread.new do
              start = Time.now
              if client == "extralite"
                10_000.times do
                  records = EXTRALITE_CONNECTIONS[thread_number].query_ary("SELECT * FROM users LIMIT #{limit}")
                  raise "Expected #{limit} but got #{length}" unless records.length == limit
                end
              else
                10_000.times do
                  records = SQLITE3_CONNECTIONS[thread_number].query("SELECT * FROM users LIMIT #{limit}").entries
                  raise "Expected #{limit} but got #{length}" unless records.length == limit
                end
              end
            end
          end
          threads.each(&:join)
        end
      end
      puts
    end
  end
end
GC.enable

Results

These results are from a Hetzner server equipped with a Ryzen 5950x CPU running Ubuntu 22.04 / Ruby 3.3.0-rc1. I'm using extralite 2.3 and sqlite3 1.6.9.

                        user     system      total        real
extralite - limit: 1, threads: 1  0.140148   0.000000   0.140148 (  0.143628)
  sqlite3 - limit: 1, threads: 1  0.128365   0.017156   0.145521 (  0.146815)

extralite - limit: 1, threads: 2  0.389131   0.143870   0.533001 (  0.353893)
  sqlite3 - limit: 1, threads: 2  0.290792   0.005425   0.296217 (  0.295629)

extralite - limit: 1, threads: 4  1.295837   0.741622   2.037459 (  1.010532)
  sqlite3 - limit: 1, threads: 4  0.568105   0.041114   0.609219 (  0.608620)

extralite - limit: 1, threads: 8  2.167029   1.389204   3.556233 (  1.739894)
  sqlite3 - limit: 1, threads: 8  1.103070   0.026563   1.129633 (  1.128541)

extralite - limit: 10, threads: 1  0.115242   0.012818   0.128060 (  0.129256)
  sqlite3 - limit: 10, threads: 1  0.294360   0.009548   0.303908 (  0.303867)

extralite - limit: 10, threads: 2  0.706740   0.278774   0.985514 (  0.730320)
  sqlite3 - limit: 10, threads: 2  0.575911   0.019264   0.595175 (  0.595926)

extralite - limit: 10, threads: 4  1.931356   1.165848   3.097204 (  2.578283)
  sqlite3 - limit: 10, threads: 4  1.124513   0.035563   1.160076 (  1.159298)

extralite - limit: 10, threads: 8  4.063315   2.494903   6.558218 (  5.356409)
  sqlite3 - limit: 10, threads: 8  2.298358   0.022259   2.320617 (  2.319177)

extralite - limit: 100, threads: 1  0.515307   0.000000   0.515307 (  0.518008)
  sqlite3 - limit: 100, threads: 1  1.762169   0.038139   1.800308 (  1.803624)

extralite - limit: 100, threads: 2  5.331649   4.482695   9.814344 (  9.385591)
  sqlite3 - limit: 100, threads: 2  3.631310   0.000000   3.631310 (  3.627896)

extralite - limit: 100, threads: 4 12.327495   9.248137  21.575632 ( 20.498224)
  sqlite3 - limit: 100, threads: 4  7.212354   0.000000   7.212354 (  7.209987)

extralite - limit: 100, threads: 8 26.625104  19.910282  46.535386 ( 44.283113)
  sqlite3 - limit: 100, threads: 8 14.774608   0.000000  14.774608 ( 14.769324)

I've noted that run to run variance can be quite high. But the pattern seems clear in that latency of sqlite3 increases linearly when adding threads (as one would expect from a GVL locking implemnentation), while extralite latency increase is a bit all over the place, but overall higher than linear, and probably higher than one would like. The jump from 1 to 2 threads is especially gruesome, eg. the 100 limit test being 10x slower when using 2 threads instead of 1.

Oldmoe's hypothesis is that excessive GVL releasing may be hurting performance. But it may require a deeper dive to reason about all the factors. Or maybe I'm just doing the benchmarking wrong 😅

noteflakes commented 11 months ago

Oldmoe's hypothesis is that excessive GVL releasing may be hurting performance. But it may require a deeper dive to reason about all the factors. Or maybe I'm just doing the benchmarking wrong 😅

I can confirm the results. I guess the cause for the slowness and high variance is the GVL releasing. This is certainly disappointing. I'll guess the only solution is to add a mode where the GVL is not released :smile: .

dbackeus commented 11 months ago

Maybe there's a sweet spot middle ground to figure out? I feel like, on the whole, some GVL release is necessary to prevent applications from completely freezing up on long running queries. Even a small performance penalty, if necessary, would be worth that flexibility.

Here's what moe suggested:

I think the idea of releasing the GVL for every statement is unnecessary overhead, moreover, releasing it for every row fetched is just not sensible. For multi threaded environments, I think the best approach would be to have a version of execute that runs without the GVL, and only for the first call of step, this way, application writers can use these for specific purposes, like for example the AR load_async interface, for SQLite this could run in a background thread with the GVL turned off during the first and only the first call to step,

Even in a fiber based environment, a background thread to run these long queries should be OK in most cases.

I don't have any insight into the implementation details so I'm not sure what it means, but it seems to suggest some middle ground might be possible.

noteflakes commented 11 months ago

I think the idea of releasing the GVL for every statement is unnecessary overhead, moreover, releasing it for every row fetched is just not sensible. For multi threaded environments, I think the best approach would be to have a version of execute that runs without the GVL, and only for the first call of step, this way, application writers can use these for specific purposes, like for example the AR load_async interface, for SQLite this could run in a background thread with the GVL turned off during the first and only the first call to step,

This sounds like a good idea. Let me see what I can do.

noteflakes commented 11 months ago

After implementing a gvl_mode setting and adding db.gvl_mode = :hold to the benchmark script, here are the results:

       user     system      total        real
extralite - limit: 1, threads: 1  0.011072   0.003621   0.014693 (  0.014709)
  sqlite3 - limit: 1, threads: 1  0.026727   0.004549   0.031276 (  0.031354)

extralite - limit: 1, threads: 2  0.029291   0.003167   0.032458 (  0.032408)
  sqlite3 - limit: 1, threads: 2  0.052903   0.012266   0.065169 (  0.065166)

extralite - limit: 1, threads: 4  0.057473   0.011788   0.069261 (  0.069064)
  sqlite3 - limit: 1, threads: 4  0.116377   0.028080   0.144457 (  0.144495)

extralite - limit: 1, threads: 8  0.105049   0.016368   0.121417 (  0.121112)
  sqlite3 - limit: 1, threads: 8  0.216640   0.023925   0.240565 (  0.240585)

extralite - limit: 10, threads: 1  0.024705   0.000084   0.024789 (  0.025352)
  sqlite3 - limit: 10, threads: 1  0.048553   0.008088   0.056641 (  0.056730)

extralite - limit: 10, threads: 2  0.048144   0.000000   0.048144 (  0.048180)
  sqlite3 - limit: 10, threads: 2  0.095901   0.015965   0.111866 (  0.111927)

extralite - limit: 10, threads: 4  0.083829   0.007898   0.091727 (  0.091426)
  sqlite3 - limit: 10, threads: 4  0.189711   0.024126   0.213837 (  0.214113)

extralite - limit: 10, threads: 8  0.182138   0.011527   0.193665 (  0.193720)
  sqlite3 - limit: 10, threads: 8  0.388128   0.028402   0.416530 (  0.416619)

extralite - limit: 100, threads: 1  0.104511   0.000000   0.104511 (  0.104554)
  sqlite3 - limit: 100, threads: 1  0.260976   0.016016   0.276992 (  0.277149)

extralite - limit: 100, threads: 2  0.172389   0.011990   0.184379 (  0.184530)
  sqlite3 - limit: 100, threads: 2  0.512232   0.031894   0.544126 (  0.544166)

extralite - limit: 100, threads: 4  0.313612   0.036185   0.349797 (  0.349906)
  sqlite3 - limit: 100, threads: 4  1.028803   0.039815   1.068618 (  1.068644)

extralite - limit: 100, threads: 8  0.615005   0.063960   0.678965 (  0.677928)
  sqlite3 - limit: 100, threads: 8  2.079794   0.091753   2.171547 (  2.171997)
noteflakes commented 11 months ago

I'll continue work on adding a hybrid GVL mode as per @oldmoe's suggestion, where the GVL is released on fetching the first row, then held on the rest and we'll see how that performs.

oldmoe commented 10 months ago

A note on David's supplied benchmark, it lacks a single threaded baseline, since the threads = 1 state has actually another thread spawned next to the main thread, I believe this is where Extralite works best, and will probably be slightly better without the GVL release.

noteflakes commented 10 months ago

Fixed in #46.