Closed rupurt closed 2 years ago
SQLite is probably not the best choice for this kind of system due to the high number of writes.
In my experience and benchmarks, if you don't have a need for multiple nodes or long-running (write) transactions, SQLite3 will generally be faster than Postgres and the like, due to lack of client/server overhead. Most of the bad reputation is due to SQLite3 not having WAL on by default due to backward compat concerns.
Thanks for this data. I have a feeling we are not cleaning up statement objects correctly in the exqlite
driver, as I can't think of anything that could be causing a memory leak on the ecto adapter side.
It could be the prepared statements not having their destructors called when the reference count goes to zero. Very odd. I'll take a look later in the day but that's what I have a feeling is happening.
@kevinlang it could also be that we aren't free'ing some binary data for sql statements / results. I haven't noticed any crazy amounts of memory climb in my apps, but if @rupurt is issuing a ton of updates and selects, that could be the culprit.
EDIT:
specifically the make_cell
stuff where we take the results from sqlite and turn them into data that elixir / erlang can understand.
That's a good point. Most of the anecdotes I read about GC issues in BEAM usually come down to large binaries causing an issue. If I have time I'll try to look into that.
@rupurt can you comment more on the "lock up" issues? I think that may be a separate issue. Is the VM itself locking up (due to memory issues) or just the database? Typically for the latter you should get an error message - what sort are you getting?
The most common cause of lock up is usually due to upgrading a READ transaction to a WRITE transaction. https://sqlite.org/isolation.html
In my experience and benchmarks, if you don't have a need for multiple nodes or long-running (write) transactions, SQLite3 will generally be faster than Postgres and the like, due to lack of client/server overhead. Most of the bad reputation is due to SQLite3 not having WAL on by default due to backward compat concerns.
Cool. Thanks gents for looking into it so quickly. Interesting to know about the performance. I'm only basing that on what I hear from other folks, so that encourages me to keep pushing forward with the current strategy :)
@rupurt can you comment more on the "lock up" issues? I think that may be a separate issue. Is the VM itself locking up (due to memory issues) or just the database? Typically for the latter you should get an error message - what sort are you getting?
The "lock up" problem isn't a deadlock. I've run into deadlocks in the past but they were problems with my code.
The "lock up" behavior I'm seeing now is not a total VM lock. It's a connection lock. I can remote attach to the instance and issue commands in IEx, but I can't query anything from the SQLite DB. Everything is working fine for ~30 mins or so, memory climbs linearly and then eventually the connection just locks up.
I don't have any logs handy with the error message. But I think it eventually timed out after 15 seconds or so.
FWIW I'm running this on a GCP n1-standard-1
instance
@rupurt do you have a chart of the number of queries executed along with the memory usage? I'm just wanting to get a sense of the volume.
@warmwaffles I don't. But that does sound handy so let me figure out if I can add one. I'll also post the logs for the timeout error on this next run.
This is the error I get
338335-22:39:22.294 [error] Exqlite.Connection (#PID<0.2507.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2515.0> timed out because it queued and checked out the connection for longer than 15000ms
338336-
338337-#PID<0.2515.0> was at location:
338338-
338339- (stdlib 3.14) gen.erl:208: :gen.do_call/4
338340- (stdlib 3.14) gen_event.erl:282: :gen_event.rpc/2
338341- (logger 1.11.3) lib/logger/handler.ex:105: Logger.Handler.log/2
338342- (kernel 7.2.1) logger_backend.erl:51: :logger_backend.call_handlers/3
338343- (ecto_sql 3.6.1) lib/ecto/adapters/sql.ex:926: Ecto.Adapters.SQL.log/4
338344- (db_connection 2.4.0) lib/db_connection.ex:1460: DBConnection.log/5
338345- (ecto_sqlite3 0.5.5) lib/ecto/adapters/sqlite3/connection.ex:91: Ecto.Adapters.SQLite3.Connection.query/4
338346- (ecto_sql 3.6.1) lib/ecto/adapters/sql.ex:786: Ecto.Adapters.SQL.struct/10
I've been walking through the nif code and can't spot any egregious leaks.
I wonder if the statements are never being free'd because something is still holding a reference to it.
@rupurt would you mind checking out https://github.com/elixir-sqlite/exqlite/pull/155 and running that with your stack to see if the issue is still present?
I don't have a reliable way to reproduce the issue.
Shoot, sorry @warmwaffles. I ended up resolving this on my side. It was a bug in my code... :/
Haha, well @rupurt it actually sent me down a rabbit hole and I think this is more friendly now. Give the latest version a shot.
@warmwaffles @kevinlang Hi, I'm a little bit out of ideas.. Maybe you could help In our project we have a process which reads 1000 entries from Sqlite DB, validates them and then reads next 1000. The DB could be really big, so the process can do it for a long period of time.
Now K8s shows that our pod uses all memory 5 GBs. Attached a screenshot (pink line) But the erlang node (recon_alloc) says it allocated only 1.2Gbs (usage 0.8).
I’m thinking if it’s possible that we have memory leak in Sqlite?
We’re using:
{:ecto_sqlite3, "~> 0.7.2"},
{:exqlite, "~> 0.9.3",
override: true,
system_env: [{"CFLAGS", "-c -O2 -DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=104857600"}]},
Repo Config:
database: "audit.db",
busy_timeout: 30_000,
timeout: 60_000,
after_connect_timeout: 60_000,
cache_size: -2000,
queue_target: 60_000,
queue_interval: 500,
log: false
When the Pod is out of memory (it's a limit 5 GBs) then I see in the log errors:
"metadata": "module=DBConnection.Connection function=disconnect/2 line=148 pid=<0.3754.0> ", "message": "Exqlite.Connection (#PID<0.3754.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.8217.170> timed out because it queued and checked out the connection for longer than 60000ms
#PID<0.8217.170> was at location:
(exqlite 0.9.3) lib/exqlite/sqlite3.ex:89: Exqlite.Sqlite3.multi_step/3
(exqlite 0.9.3) lib/exqlite/sqlite3.ex:136: Exqlite.Sqlite3.fetch_all/4
(exqlite 0.9.3) lib/exqlite/connection.ex:566: Exqlite.Connection.get_rows/2
(exqlite 0.9.3) lib/exqlite/connection.ex:512: Exqlite.Connection.execute/4
(db_connection 2.4.1) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.4.1) lib/db_connection.ex:1333: DBConnection.run_execute/5
(db_connection 2.4.1) lib/db_connection.ex:1428: DBConnection.run/6
Thank you for your time.
Code how it reads:
def select_page(query, page, limit) do
page =
query
|> paginate(page, limit)
|> all()
{:ok, page}
end
defp paginate(query, page, limit) do
Ecto.Query.from(query,
limit: ^limit,
offset: ^((page - 1) * limit)
)
end
I don't know if kevin is going to be helping anymore, he's been pretty silent for the last few months. I am however going to take a look at this again.
Side note, this is news to me.
system_env: [{"CFLAGS", "-c -O2 -DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=104857600"}]
Does that actually set the environment values during compilation?
Thanks a lot. Could I help some how to debug it?
About DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT. It seems so. We change it in mix.exs and in dockerfile like this: ENV CFLAGS="-c -O2 -DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=104857600" and the limit is 100 MB.
@rupurt do you recall how you resolved the memory leak issue on your side of things?
@lauragrechenko Are you doing any other large-ish queries to sqlite3? I wonder if this is tied to the timeout issues that @LostKobrakai was experiencing a few weeks back.
Also, would you be able to capture what is happening at this point?
Each node has "audit.db". On all nodes writing is happening: ~100 entries every 2 seconds. On this node, where you pointed (on the picture above), I started a process which reads and validates 1000 entries. It started growing immediately. So on "pink" node writing is happening every 2 seconds and reading 1000 entries every few seconds.
Just to clear up some potential confusion here. NIF memory usage is not tracked by the erlang vm unless specifically hooked into it. So it's expected to not see erlang report the memory usage of sqlite. This does however not mean that there's no memory leak.
I've seem some stange behaviour this week around a large delete query causing reboots in our nerves system though, which I'm not yet sure about the root cause.
Are those 100 entries being written in batches or sequentially? If they are happening sequentially, it may not be full filling all of the writes in time and slowly backing up.
Do you know how big the database is on each node?
For writing we use Ecto.Repo.insert_all/3. The limit for writing 1 batch is 1000 entries. But in our normal case we have only ~100 entries for writing in 1 batch.
But I can stop all writing, it's only reading by this 1 process. And it's the same - memory grows. I turned off validating data, just read 1000 entries, sleep for 1 sec and read next part.
Now the DB is ~5.5 GB
I need to look into using a custom memory allocator for sqlite to use erlang's machinery so we can get some better telemetry on it. And add telemetry to exqlite in general.
It's reading 1_000_000 entries, by 1000, pause 1 s
Enum.each(1..1000, fn page -> Process.sleep(1000); Repo.select_page(Audit.Schema, page, 1000) end)
Erlang node again shows "allocated memory" ~525MB. And k8s: 1100MB And it doesn't go down
@warmwaffles the memory leak from my initial post was 110% my own crappy application code :)
I was cancelling and creating many orders. The orders that were cancelled were in their final resting state. So I just cleared out the callbacks for those orders as they should never get executed.
I don't know if kevin is going to be helping anymore, he's been pretty silent for the last few months. I am however going to take a look at this again.
Side note, this is news to me.
system_env: [{"CFLAGS", "-c -O2 -DSQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=104857600"}]
Does that actually set the environment values during compilation?
Yes, it does. :) It's kind of hack-ish, but couldn't do it any different way.
Yes, it does. :) It's kind of hack-ish, but couldn't do it any different way.
This is a decent option that I'll need to add to the documentation so others can utilize it if they want to enable / disable features when compiling sqlite.
@lauragrechenko I haven't forgotten about this, I just did not have time this weekend to dig into it more. I'm going to build a benchmarking / load testing suite soon to try and pin point the issue.
@warmwaffles I was about to write you. Sorry for wasting your time. I think we can close the issue. I'm testing it now but I think we found an issue in usage Sqlite with a limit, offset. The issue is "By doing a query with a offset of 95000, all previous 95000 records are processed".
Maybe it'll help someone: https://stackoverflow.com/questions/12266025/sqlite-query-optimization-using-limit-and-offset
So now instead of
Ecto.Query.from(query, limit: ^limit, offset: ^((page - 1) * limit))
I tried
Ecto.Query.from(q in query, where: q.id >= ^from and q.id < ^till)
and it seems working just fine :)
Heh, I still need to build a good benchmark and test suite along with adding better telemetry for memory usage and what not.
@lauragrechenko I've opened a PR here to utilize a custom memory allocator for sqlite. If you could give it a run in your environment for tests or something, feedback would be extremely welcome.
@warmwaffles Thanks. I'll try it today. we still can see memory is growing, not so significantly as on the screenshot above but still growing on all nodes.. and Erlang node still says it doesn't allocate so much memory But we have others lib with Nif, so not sure yet what's going on
@lauragrechenko released v0.10.0 that has the custom allocator in place now and erlang vm memory usage will now include sqlite usage.
@warmwaffles Thanks a lot. Yeah, we couldn't start it but today I took your PR and a fix from @laszlohegedus. I'm already running tests, it's been only for 2 hours, so in a few more hours I can say about current "memory usage".
@warmwaffles Hi, the memory looks just fine now. Thanks a lot
Howdy :wave:
I'm in the process of converting a market making trading system that creates many orders from ETS to Ecto. I plan on supporting any database that Ecto currently has support for, but I've started with SQLite because I want to be able to distribute it without any external dependencies.
SQLite is probably not the best choice for this kind of system due to the high number of writes. But I've pushed forward with the philosophy that if I can make it work somewhat performant in SQLite it should be great with other DB's.
Everything has gone pretty well so far, however now that I'm running it in production for long periods of time I'm noticing that there is severe performance degradation over time that causes a total lock in the SQLite DB. I've attached a graph below of my telemetry output. It shows:
I'm also attaching a graph of query times with the hope that they're helpful
The following SQL statements are executed regularly as the hot path
They're issued from this Elixir module https://github.com/fremantle-industries/tai/blob/orders-ecto-repo/apps/tai/lib/tai/new_orders/services/apply_order_transition.ex