crystal-lang / crystal-sqlite3

SQLite3 bindings for Crystal
MIT License
139 stars 30 forks source link

Invalid memory access when using `#scalar` or `#exec` directly in multi-thread mode (`-Dpreview_mt`) #87

Closed luislavena closed 1 year ago

luislavena commented 1 year ago

Hello!

Been testing out preview_mt mode with SQLite and encountered some issues. It has been hard to reproduce this, but I was able to get a small test program:

require "sqlite3"

COUNT = 10

puts "preview_mt: ", {{ flag?(:preview_mt) }}

db = DB.open("sqlite3:data.db?journal_mode=wal&busy_timeout=5000")

COUNT.times do |i|
  spawn do
    rand_id = rand(1..500)

    puts db.scalar("SELECT age FROM contacts WHERE id = ? LIMIT 1;", rand_id).as(Int64)
  end
end

Fiber.yield

db.close

data.db contains a simple schema with 500 records in it:

require "sqlite3"

CREATE_SQL = <<-SQL
  CREATE TABLE IF NOT EXISTS contacts (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL,
    age INTEGER NOT NULL
  );
SQL

INSERT_SQL = <<-SQL
  INSERT INTO contacts (id, name, age) VALUES (?, ?, ?);
SQL

DB.open("sqlite3:data.db?journal_mode=wal&busy_timeout=5000") do |db|
  db.exec CREATE_SQL

  db.transaction do
    500.times do |t|
      id = t + 1
      db.exec INSERT_SQL, id, "Name #{id}", id + 20
    end
  end
end

When compiled and executed with -Dpreview_mt, I randomly receive something like the following:

preview_mt: 
true
202
500
393
337
Invalid memory access (signal 11) at address 0x0
Unhandled exception in spawn: bad parameter or other API misuse (SQLite3::Exception)
[0x557449e450f6] *Exception::CallStack::print_backtrace:Nil +118 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
Invalid memory access (signal 11) at address 0x0
426
[0x557449e30646] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +310 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449e450f6] *Exception::CallStack::print_backtrace:Nil +118 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x7f20a5a42520] ?? +139778194679072 in /lib/x86_64-linux-gnu/libc.so.6
[0x557449e30646] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +310 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x7f20a5ffbc76] sqlite3_reset +22 in /lib/x86_64-linux-gnu/libsqlite3.so.0
[0x7f20a5a42520] ?? +139778194679072 in /lib/x86_64-linux-gnu/libc.so.6
[0x557449f2c42b] *SQLite3::ResultSet#do_close:Int32 +27 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x7f20a5ffbc76] sqlite3_reset +22 in /lib/x86_64-linux-gnu/libsqlite3.so.0
[0x557449f2c3dc] *SQLite3::ResultSet +60 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f2c42b] *SQLite3::ResultSet#do_close:Int32 +27 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f29f63] *DB::Statement+ +291 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f2c3dc] *SQLite3::ResultSet +60 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f2a7b9] *DB::PoolStatement+ +633 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f29f63] *DB::Statement+ +291 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f1195b] *DB::Database +59 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f2a7b9] *DB::PoolStatement+ +633 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449e31b9b] ~procProc(Nil) +155 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449f1195b] *DB::Database +59 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449e8bc52] *Fiber#run:(IO::FileDescriptor | Nil) +114 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449e31b9b] ~procProc(Nil) +155 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449e30166] ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) +6 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x557449e8bc52] *Fiber#run:(IO::FileDescriptor | Nil) +114 in /home/luis/repos/github.com/luislavena/test-sqlite3-mt/bin/test
[0x0] ???

Sometimes receiving bad parameter or other API misuse, others unable to close due to unfinalized statements or unfinished backups

I was able to workaround this by using using_connection block instead:

COUNT.times do |i|
  spawn do
    rand_id = rand(1..500)

    db.using_connection do |conn|
      puts conn.scalar("SELECT age FROM contacts WHERE id = ? LIMIT 1;", rand_id).as(Int64)
    end
  end
end

Which points there might be an issue on how #exec, #scalar and possible other calls are handled by DB::Database and passed to one of the connections in the pool.

But even so, I will still randomly get errors (small adaptation to my example on my site).

Since I'm blowing things up, decided to put more dynamite into the issue and tried ysbaddaden/pool shard:

db = ConnectionPool(DB::Connection).new(capacity: 20) do
  DB.connect(DATABASE_URL)
end

And then:

    db.connection do |conn|
      conn.scalar("SELECT name FROM contacts WHERE id = ? LIMIT 1;", id).as(String)
    end

To my surprise, these random exceptions are gone and performance is even higher:

$ plow -c 20 -d 10s http://localhost:8080
Benchmarking http://localhost:8080 for 10s using 20 connection(s).
@ Real-time charts is listening on http://[::]:18888

Summary:
  Elapsed        10s
  Count       746816
    2xx       746816
  RPS      74681.412
  Reads    7.392MB/s
  Writes   4.131MB/s

Statistics    Min       Mean    StdDev     Max   
  Latency     29µs     266µs     804µs   35.593ms
  RPS       71105.09  74675.39  1390.64  75844.23

Latency Percentile:
  P50     P75    P90    P95     P99     P99.9   P99.99 
  146µs  169µs  201µs  257µs  6.482ms  6.658ms  6.901ms

Latency Histogram:
  149µs    632328  84.67%
  268µs     99222  13.29%
  3.697ms    7679   1.03%
  6.503ms    5546   0.74%
  6.592ms    1752   0.23%
  6.656ms     190   0.03%
  6.786ms      87   0.01%
  7.004ms      12   0.00%

Tried running this project own specs with -Dpreview_mt and fails with the same issue, so not sure where to start looking at.

Apologies for the rambling, but wanted to make sure I included all the information I think is connected and relevant, I hope is not too much.

Any ideas where I should start looking at first?

Thank you in advance. ❤️ ❤️ ❤️

bcardiff commented 1 year ago

This on aarch64, right? Have you tried on x86_64? Which crystal version are you using?

luislavena commented 1 year ago

@bcardiff this is on x86_64, using 1.7.3 under Alpine 3.17.

Also used static binaries from the release under Ubuntu 22.04

luislavena commented 1 year ago

In response to running crystal spec -Dpreview_mt, I had to use db.using_connection too:

diff --git spec/pool_spec.cr spec/pool_spec.cr
index 4971b94..77f34c4 100644
--- spec/pool_spec.cr
+++ spec/pool_spec.cr
@@ -5,13 +5,15 @@ describe DB::Pool do
     channel = Channel(Nil).new
     fibers = 5
     max_n = 50
-    with_db "#{DB_FILENAME}?max_pool_size=#{fibers}" do |db|
+    with_db "#{DB_FILENAME}?#{"journal_mode=wal&busy_timeout=5000&" if {{ flag?(:preview_mt) }}}max_pool_size=#{fibers}" do |db|
       db.exec "create table numbers (n integer, fiber integer)"

       fibers.times do |f|
         spawn do
           (1..max_n).each do |n|
-            db.exec "insert into numbers (n, fiber) values (?, ?)", n, f
+            db.using_connection do |conn|
+              conn.exec "insert into numbers (n, fiber) values (?, ?)", n, f
+            end
             sleep 0.01
           end
           channel.send nil

SQLite3 databases are blocked by default for insert when accessed from multiple threads/processes, so I had to change journal_mode and busy_timeout too, without them, was hitting the following error:

Unhandled exception in spawn: database is locked (SQLite3::Exception)

Without the #using_connection change was randomly hitting the previous errors.

Simplified example:

require "./src/sqlite3"

COUNT = 10

puts "preview_mt: ", {{ flag?(:preview_mt) }}

db = DB.open("sqlite3:data.db?journal_mode=wal&busy_timeout=5000")

COUNT.times do |i|
  spawn do
    rand_id = rand(1..500)

    {% if flag?(:use_conn) %}
      db.using_connection do |conn|
        puts conn.scalar("SELECT age FROM contacts WHERE id = ? LIMIT 1;", rand_id).as(Int64)
      end
    {% else %}
      puts db.scalar("SELECT age FROM contacts WHERE id = ? LIMIT 1;", rand_id).as(Int64)
    {% end %}
  end
end

sleep 5
Fiber.yield

db.close

When run with -Dpreview_mt only:

$ crystal run -Dpreview_mt test.cr 
preview_mt: 
true
145
34
263
127
399
Invalid memory access (signal 11) at address 0x7f5514436878
307
404
[0x55fe5e2804f6] *Exception::CallStack::print_backtrace:Nil +118 in /home/user/.cache/crystal/crystal-run-test.tmp
[0x55fe5e26bab6] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +310 in /home/user/.cache/crystal/crystal-run-test.tmp
[0x7f551688eacd] ?? +140003427019469 in /lib/ld-musl-x86_64.so.1

When run with -Duse_conn:

$ crystal run -Dpreview_mt -Duse_conn test.cr 
preview_mt: 
true
313
254
81
403
214
264
313
229
229
386

And here the backtrace when crashed in Ubuntu 22.04:

preview_mt: 
true
142
368
129
66
244
115Invalid memory access (signal 11) at address 0x0

271
505
[0x556f0bcb9066] *Exception::CallStack::print_backtrace:Nil +118 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bca4656] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +310 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x7f6dac642520] ?? +140109020407072 in /lib/x86_64-linux-gnu/libc.so.6
[0x7f6dacbd6c76] sqlite3_reset +22 in /lib/x86_64-linux-gnu/libsqlite3.so.0
[0x556f0bda035b] *SQLite3::ResultSet#do_close:Int32 +27 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bda030c] *SQLite3::ResultSet +60 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bd9de93] *DB::Statement+ +291 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bd9e6e9] *DB::PoolStatement+ +633 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bd8588b] *DB::Database +59 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bca5b83] ~procProc(Nil) +115 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bcffbc2] *Fiber#run:(IO::FileDescriptor | Nil) +114 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x556f0bca4176] ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) +6 in /home/luis/.cache/crystal/crystal-run-test.tmp
[0x0] ???
luislavena commented 1 year ago

Hello, doing a bit of digging on the weird MISUSE error and the Invalid memory access and seems to be related to the usage of sqlite3_reset when cleaning up the usage of a Prepared Statement (SQLite3::ResultSet#do_close).

Inspecting what happens in do_close by DB itself, it releases the connection back to the pool, and then we proceed to clear the statement (using sqlite3_reset).

Noticed that inverting LibSQLite3.reset(self) call and do do_close after removed the invalid memory access error, but still crashed (increased COUNT to 20 and were only getting 10-12 responses and silently ending).

Is possible that statement object is getting GC'ed before anything?

(Apologies for the noise, but trying to do dump as much details I collect while investigating this here in case someone else is looking at it).

bcardiff commented 1 year ago

Is possible that statement object is getting GC'ed before anything?

I think so, in MT. The statements are cached in a query hash. But maybe there are some race condition in MT which might cause issues.

luislavena commented 1 year ago

I think so, in MT. The statements are cached in a query hash.

Ah! I will take a look to that this weekend and report back!

Thank you for your promptly response! ❤️ ❤️ ❤️

bcardiff commented 1 year ago

I tried putting a Mutex around DB::StringKeyCache methods but I still got the invalid memory so that is still not it (or not only) ...

diff --git a/src/db/string_key_cache.cr b/src/db/string_key_cache.cr
index f2cae6292..ef77d67d5 100644
--- a/src/db/string_key_cache.cr
+++ b/src/db/string_key_cache.cr
@@ -1,21 +1,28 @@
 module DB
   class StringKeyCache(T)
     @cache = {} of String => T
+    @mutex = Mutex.new

     def fetch(key : String) : T
-      value = @cache.fetch(key, nil)
-      value = @cache[key] = yield unless value
-      value
+      @mutex.synchronize do
+        value = @cache.fetch(key, nil)
+        value = @cache[key] = yield unless value
+        value
+      end
     end

     def each_value
-      @cache.each do |_, value|
-        yield value
+      @mutex.synchronize do
+        @cache.each do |_, value|
+          yield value
+        end
       end
     end

     def clear
-      @cache.clear
+      @mutex.synchronize do
+        @cache.clear
+      end
     end
   end
 end
bcardiff commented 1 year ago

Applying the swap of the LibSQLite.reset you mention here helps reducing the errors, but they are still there.

Next potential culprit is in the DB::PoolPreparedStatement, the @connections = Set(WeakRef(Connection)).new is not thread safe.

bcardiff commented 1 year ago

I think I got it. So far it hasn't crashed here. Can you try these crystal-db and crystal-sqlite3 changes?

Above the non thread safe structures I found an issue in the pool.

luislavena commented 1 year ago

Hello @bcardiff, switched to fix-mt branches for testing and the test script used before appears to be working, even increasing COUNT=100.

However, when attempt to run this using the HTTP server from my example, I'm still getting API misuse from SQLite:

2023-03-25T11:21:01.320160Z  ERROR - http.server: Unhandled exception on HTTP::Handler
bad parameter or other API misuse (SQLite3::Exception)
  from lib/sqlite3/src/sqlite3/statement.cr:81:5 in 'check'
  from lib/sqlite3/src/sqlite3/statement.cr:4:5 in 'initialize'
  from lib/sqlite3/src/sqlite3/statement.cr:2:3 in 'new'
  from lib/sqlite3/src/sqlite3/connection.cr:19:5 in 'build_prepared_statement'
  from lib/db/src/db/connection.cr:40:40 in 'fetch_or_build_prepared_statement'
  from lib/db/src/db/session_methods.cr:23:9 in 'build'
  from lib/db/src/db/query_methods.cr:284:7 in 'scalar'
  from src/server.cr:21:5 in 'fetch_name'
  from src/server.cr:14:12 in 'call'
  from /usr/local/share/crystal/src/http/server/request_processor.cr:51:11 in 'process'
  from /usr/local/share/crystal/src/http/server.cr:515:5 in 'handle_client'
  from /usr/local/share/crystal/src/http/server.cr:468:13 in '->'
  from /usr/local/share/crystal/src/fiber.cr:146:11 in 'run'
  from /usr/local/share/crystal/src/fiber.cr:98:34 in '->'
  from ???

Repo used to reproduce this: https://github.com/luislavena/test-crystal-sqlite3-db


I'm wondering the following (bear with me on my ignorance of the internals of crystal-db): statements are created in the context of a connection, but the connection can be checked out from the pool independently of the statement we are going to execute, right?

https://github.com/crystal-lang/crystal-sqlite3/blob/master/src/sqlite3/statement.cr#L4

When we do db.scalar(query) it will use one of the connections from the pool that matches that query (fetch_or_build_prepared_statement) and according to PoolPreparedStatement, the statement has been build/prepared for all the available connections.

This means that all the time the connection of a statement is the same as the connection pooled from the pool, so when I do conn.scalar(query) it uses the prepared statement for conn connection while db.scalar(query) will retrieve the an available connection and use the prepared statement for it.

Is possible that when retrieved the prepared statement from the pool, the connection is not longer available?

When compiled and run in Ubuntu 22.04, I managed to get a different backtrace:

Invalid memory access (signal 7) at address 0x0
[0x56525a29c716] *Exception::CallStack::print_backtrace:Nil +118 in ./bin/server
[0x56525a27ef0e] ~procProc(Int32, Pointer(LibC::SiginfoT), Pointer(Void), Nil) +366 in ./bin/server
[0x7f83a3c42520] ?? +140203364984096 in /lib/x86_64-linux-gnu/libc.so.6
[0x7f83a4683381] sqlite3VdbeReset +113 in /lib/x86_64-linux-gnu/libsqlite3.so.0
[0x7f83a4671c91] sqlite3_reset +49 in /lib/x86_64-linux-gnu/libsqlite3.so.0
[0x56525a43b6ab] *SQLite3::Statement#perform_query<Tuple(Int32)>:SQLite3::ResultSet +59 in ./bin/server
[0x56525a43baff] *DB::Statement+ +159 in ./bin/server
[0x56525a43ba55] *DB::Statement+ +69 in ./bin/server
[0x56525a43b8ea] *DB::Statement+ +122 in ./bin/server
[0x56525a43c31f] *DB::PoolStatement+ +655 in ./bin/server
[0x56525a3f59f4] *DB::Database +84 in ./bin/server
[0x56525a40b11f] *RandomContactApp#fetch_age<Int32>:Int64 +63 in ./bin/server
[0x56525a40b07d] *RandomContactApp#call<HTTP::Server::Context>:Nil +109 in ./bin/server
[0x56525a411d9d] *HTTP::Server::RequestProcessor#process<IO+, IO+>:Nil +1053 in ./bin/server
[0x56525a40ccdc] *HTTP::Server#handle_client<IO+>:Nil +2092 in ./bin/server
[0x56525a2855af] ~procProc(Nil) +31 in ./bin/server
[0x56525a30669d] *Fiber#run:(IO::FileDescriptor | Nil) +125 in ./bin/server
[0x56525a27e9ad] ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) +29 in ./bin/server
[0x0] ???

Attempt to debug this further with gdb but no luck so far on what could be the culprit 😿

Will continue to investigate after a walk around the park 😀

bcardiff commented 1 year ago

The PoolPreparedStatement should handle checking out connections, despite having or not prepared the statement already. Actually the ones that are known to have the statement prepared already will be given priority. Pool#checkout receives candidates list but is not a hard restriction.

In that process closed connections should be discarded.

Knowing if a connection is closed requires trying to use it. I assume sqlite connections can't be lost from sqlite side, there are no sockets involved. Maybe I'm wrong here. If I'm wrong maybe there is a missing map between a sqlite error and ResourceConnectionLost error. This last kind of error are retried when doing db#query directly. This is the main difference when doing query over conn vs db.

bcardiff commented 1 year ago

Another thing we could try is to disable the PoolPreparedStatement cache as a way to reduce the amount of things involved. There is no current way to disable it but it might be worth of having that in other scenarios.

bcardiff commented 1 year ago

I disabled the statement cache (in a dirty way) and was able to get a different backtrace. This is using the http example with wrk -c20 -d10s http://localhost:8080. Since ends with a sigtrap it might be useful.

[0x1a120b4a4] _sigtramp +56 in /usr/lib/system/libsystem_platform.dylib
[0x1a120b4a4] _sigtramp +56 in /usr/lib/system/libsystem_platform.dylib
[0x1a77007d8] yy_reduce +31220 in /usr/lib/libsqlite3.dylib
[0x1a771b69c] sqlite3SrcListAppend +96 in /usr/lib/libsqlite3.dylib
[0x1a771b3b0] sqlite3SrcListAppendFromTerm +240 in /usr/lib/libsqlite3.dylib
[0x1a76f7bd4] sqlite3RunParser +1016 in /usr/lib/libsqlite3.dylib
[0x1a76f9b94] yy_reduce +3504 in /usr/lib/libsqlite3.dylib
[0x1a76f7bd4] sqlite3RunParser +1016 in /usr/lib/libsqlite3.dylib
[0x1a76f6f78] sqlite3Prepare +432 in /usr/lib/libsqlite3.dylib
[0x1a76f6c08] sqlite3LockAndPrepare +220 in /usr/lib/libsqlite3.dylib
[0x1a76f6f78] sqlite3Prepare +432 in /usr/lib/libsqlite3.dylib
[0x1a76f6c08] sqlite3LockAndPrepare +220 in /usr/lib/libsqlite3.dylib
[0x102234a8c] *SQLite3::Statement#initialize<SQLite3::Connection, String>:Nil +116 in /Users/bcardiff/.cache/crystal/crystal-run-server.tmp
bcardiff commented 1 year ago

So, sqlite3_config seems to be needed to be called on multi-thread. ref

fun config = sqlite3_config(Int32) : Int32
# One of these line should be added to the webserver
LibSQLite3.config(2) # MULTITHREAD
LibSQLite3.config(3) # SERIALIZED

With MULTITHREAD I am still seeing the same error yy_reduce (I'm using the native sqlite3 of osx /usr/lib/libsqlite3.dylib (compatibility version 9.0.0, current version 331.5.0), if I read things around it seems that MT support should be available there)

With SERIALIZED there error changed again. Among PoolRetryAttemptsExceeded and accept: Too many open files (Socket::Error) I got:

Unhandled exception: unable to close due to unfinalized statements or unfinished backups (SQLite3::Exception)
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'Exception::CallStack::unwind:Array(Pointer(Void))'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'Exception::CallStack#initialize:Array(Pointer(Void))'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'Exception::CallStack::new:Exception::CallStack'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'raise<SQLite3::Exception>:NoReturn'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'SQLite3::Connection#check<Int32>:Nil'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'SQLite3::Connection#do_close:Nil'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'DB::Connection+@DB::Disposable#close:(Bool | Nil)'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'DB::Pool(DB::Connection+)@DB::Pool(T)#close:Nil'
  from /Users/bcardiff/.cache/crystal/crystal-run-server.tmp in 'DB::Database#close:Nil

What hints me in this direction was https://github.com/xamarin/xamarin-macios/issues/5791#issuecomment-475443164

Could it be that after fixing the MT issues in crystal-db that we already discovered we are now missing some proper way of using sqlite3 in MT?

luislavena commented 1 year ago

Hello @bcardiff, thanks for your patience.

So, sqlite3_config seems to be needed to be called on multi-thread.

That is actually not necessary if the SQLite version you're using was already compiled with threadsafe mode on, see sqlite3_threadsafe() and from the original link you provided:

The default mode is serialized.

I check that compile time thread safety was on, and left the default. Then tested against the temp/disable-stmt-cache branch, the invalid memory issue is gone, without having to change the thread safety config.

$ shards install
Resolving dependencies
Fetching https://github.com/crystal-lang/crystal-sqlite3.git
Fetching https://github.com/crystal-lang/crystal-db.git
Using db (0.11.0 at 0897477)
Using sqlite3 (0.19.0 at d759ae0)

But when attempting to terminate the server, received a new exception:

Use Ctrl-C to stop
^CShutting down.
Shutdown completed.
Unhandled exception: unable to close due to unfinalized statements or unfinished backups (SQLite3::Exception)
  from bin/server in 'Exception::CallStack::unwind:Array(Pointer(Void))'
  from bin/server in 'Exception::CallStack#initialize:Array(Pointer(Void))'
  from bin/server in 'Exception::CallStack::new:Exception::CallStack'
  from bin/server in 'raise<SQLite3::Exception>:NoReturn'
  from bin/server in 'SQLite3::Connection#check<Int32>:Nil'
  from bin/server in 'SQLite3::Connection#do_close:Nil'
  from bin/server in 'DB::Connection+'
  from bin/server in 'DB::Pool(DB::Connection+)'
  from bin/server in 'DB::Database#close:Nil'
  from bin/server in '__crystal_main'
  from bin/server in 'Crystal::main_user_code<Int32, Pointer(Pointer(UInt8))>:Nil'
  from bin/server in 'Crystal::main<Int32, Pointer(Pointer(UInt8))>:Int32'
  from bin/server in 'main'
  from /lib/ld-musl-x86_64.so.1 in '??'

Around this line:

  def do_close
    super
    check LibSQLite3.close(self)
  end

Which might indicate that @db reference is no longer a valid reference. Same happened when doing LibSQLite3.config(3) to use the serialized mode (as the xamarin comment).

I also got the too many files open when trying to read the DWARF information to decode the backtrace in Ubuntu:

Unhandled exception in spawn: Error opening file with mode 'r': '/etc/localtime': Too many open files (File::Error)
Shutdown completed.
Unable to load dwarf information: Error opening file with mode 'r': '/home/luis/repos/github.com/luislavena/test-crystal-sqlite3-db/bin/server': Too many open files (File::Error)
Unhandled exception: unable to close due to unfinalized statements or unfinished backups (SQLite3::Exception)
  from bin/server in 'Exception::CallStack::unwind:Array(Pointer(Void))'
  from bin/server in 'Exception::CallStack#initialize:Array(Pointer(Void))'
  from bin/server in 'Exception::CallStack::new:Exception::CallStack'
  from bin/server in 'raise<File::Error+>:NoReturn'
  from bin/server in 'Crystal::System::File::open<String, String, File::Permissions>:Int32'
  from bin/server in 'File::new<String, String, File::Permissions, Nil, Nil>:File'
  from bin/server in 'Exception::CallStack::read_dwarf_sections<String, UInt64>:(Array(Tuple(UInt64, UInt64, String)) | Nil)'
  from bin/server in 'Exception::CallStack::load_debug_info_impl:Nil'
  from bin/server in 'Exception::CallStack::load_debug_info:Nil'
  from bin/server in 'Exception::CallStack::decode_line_number<UInt64>:Tuple(String, Int32, Int32)'
  from bin/server in 'Exception::CallStack#decode_backtrace:Array(String)'
  from bin/server in 'Exception::CallStack#printable_backtrace:Array(String)'
  from bin/server in 'Exception+'
  from bin/server in 'Exception+'
  from bin/server in 'Fiber#run:(IO::FileDescriptor | Nil)'
  from bin/server in '~proc2Proc(Fiber, (IO::FileDescriptor | Nil))'
  from ???
  from bin/server in 'Exception::CallStack::unwind:Array(Pointer(Void))'
  from bin/server in 'Exception::CallStack#initialize:Array(Pointer(Void))'
  from bin/server in 'Exception::CallStack::new:Exception::CallStack'
  from bin/server in 'raise<SQLite3::Exception>:NoReturn'
  from bin/server in 'SQLite3::Connection#check<Int32>:Nil'
  from bin/server in 'SQLite3::Connection#do_close:Nil'
  from bin/server in 'DB::Connection+'
  from bin/server in 'DB::Pool(DB::Connection+)'
  from bin/server in 'DB::Database#close:Nil'
  from bin/server in '__crystal_main'
  from bin/server in 'Crystal::main_user_code<Int32, Pointer(Pointer(UInt8))>:Nil'
  from bin/server in 'Crystal::main<Int32, Pointer(Pointer(UInt8))>:Int32'
  from bin/server in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from bin/server in '_start'
  from ???

But similar to previous case, all this is coming from SQLite3::Connection#do_close and the fail to pass the check.

Since I was already playing, decided to try something different: use existing fix-mt branches as we had before but no longer depend on DB's Pool and instead attempt an experiment using Syn::Pool:

$ shards install
Resolving dependencies
Fetching https://github.com/crystal-lang/crystal-sqlite3.git
Fetching https://github.com/ysbaddaden/syn.git
Fetching https://github.com/crystal-lang/crystal-db.git
Using db (0.11.0 at 76eba76)
Using sqlite3 (0.19.0 at d759ae0)
Using syn (0.1.0 at d87e275)

And it didn't crash with 5, 10 or even 50 concurrent connections 😁

Added an atomic counter to determine how many connections were used in those cases and managed to get a maximum of 12 connections (with nproc of 8 and CRYSTAL_WORKERS=7).

I'm wondering if it will be possible to test this against the dummy driver and see if the pool/cache is the culprit.

bcardiff commented 1 year ago

But when attempting to terminate the server, received a new exception: (..)

It seems we need to finalize statements before closing the connection. I thought that was optional since their docs says

The application must finalize every prepared statement in order to avoid resource leaks.

I'm a bit surprised this didn't came up before.

But if disabling the statement cache solved the memory issues then it seems we need to keep digging in that direction (and maybe allow disabling the statement cache 😅 which some might want to to reduce the memory footprint)

bcardiff commented 1 year ago

Update: we are closing all the statements when closing the connection.

https://github.com/crystal-lang/crystal-db/blob/master/src/db/connection.cr#L60

https://github.com/crystal-lang/crystal-sqlite3/blob/master/src/sqlite3/connection.cr#L33

Could it be that due to MT we are trying to use a connection that is being discarded? 🤔 💭

luislavena commented 1 year ago

I'm thinking that somehow the connections are getting GC'ed after the prepared methods that were for that connection were removed.

My gut check is that DB Prepared Pool (PoolPreparedStatement) is also keeping a reference to DB::Connection and for reason the db connection is GC'ed at some point before (perhaps PoolPreparedStatement is helping on that) and then when the DB pool is shutting down the connections, that connection is already gone.

Design question, the idea of having the pool of prepared of statements is to pick one from an available connection? Do we have an idea how much we are winning by this?

Thank you for your patience and effort on troubleshooting this ❤️ ❤️ ❤️

bcardiff commented 1 year ago

Hey @luislavena, @beta-ziliani found a missing synchronize call in the crystal-db patch. Would you mind checking again with

I think I got it. So far it hasn't crashed here. Can you try these crystal-db and crystal-sqlite3 changes:

🤞 for me it didn't crash

luislavena commented 1 year ago

@bcardiff @beta-ziliani I'm happy to inform that with these changes the explosions are gone! 👏🏽 🎉 🥳

I was able to perform a long running bench (30 mins at 4 and 8 workers) without leaks or exceptions!

Thank you for investigating on this! ❤️ ❤️ ❤️

luislavena commented 1 year ago

@bcardiff opened a separate issue to investigate some performance bottleneck (#89) but I consider this issue solved.

Thank you again you both! (@beta-ziliani too) 🙏🏽 ❤️ ❤️ ❤️

luislavena commented 1 year ago

@straight-shoota didn't want to close this until the changes mentioned (fix-mt branch) were merged, only https://github.com/crystal-lang/crystal-db/pull/178 was done at this time, but remains the other: https://github.com/crystal-lang/crystal-sqlite3/compare/fix-mt

bcardiff commented 1 year ago

The fix-mt branch was merged in #90. Closing again.

Thanks @luislavena for all the efforts through this ☺️