"This slot has been invalidated because it exceeded the maximum reserved size" error

djbutler commented 6 months ago

My electric service crashed, and when I try to restart it with docker compose, I get this:

supabase-electric  | 18:42:35.329 pid=<0.2487.0> [info] Starting ElectricSQL 0.9.2 in direct_writes mode.
supabase-electric  | 18:42:35.332 pid=<0.2343.0> [notice]     :alarm_handler: {:set, {:system_memory_high_watermark, []}}
supabase-electric  | 18:42:35.341 pid=<0.2488.0> [info] Running Electric.Plug.Router with Bandit 1.1.3 at :::5133 (http)
supabase-electric  | 18:42:35.360 pid=<0.2803.0> origin=postgres_1 [info] Electric.Replication.Postgres.Client.with_conn(%{database: ~c"postgres", host: ~c"db", ip_addr: ~c"172.19.0.4", ipv6: true, nulls: [nil, :null, :undefined], password: ~c"******", port: 5432, ssl: true, ssl_opts: [server_name_indication: ~c"db"], timeout: 5000, username: ~c"postgres"})
supabase-electric  | 18:42:35.504 pid=<0.2803.0> origin=postgres_1 [info] Successfully initialized origin postgres_1 at extension version 
supabase-electric  | 18:42:35.505 pid=<0.2808.0> [info] Starting Proxy server listening on port 65432
supabase-electric  | 18:42:35.505 pid=<0.2809.0> pg_producer=postgres_1 [info] Starting Elixir.Electric.Postgres.Extension.SchemaCache for postgres_1
supabase-electric  | 18:42:35.506 pid=<0.2809.0> pg_producer=postgres_1 [warning] SchemaCache "postgres_1" registered as the global instance
supabase-electric  | 18:42:35.506 pid=<0.2812.0> [info] Starting replication from postgres_1
supabase-electric  | 18:42:35.506 pid=<0.2812.0> [info] Electric.Replication.Postgres.LogicalReplicationProducer.init(%{database: ~c"postgres", host: ~c"db", ip_addr: ~c"172.19.0.4", ipv6: true, nulls: [nil, :null, :undefined], password: ~c"******", port: 5432, replication: ~c"database", ssl: true, ssl_opts: [server_name_indication: ~c"db"], timeout: 5000, username: ~c"postgres"})
supabase-electric  | ▓ ┌────────────────────────────────────────────────────────┐
supabase-electric  | ▓ │  MODULE ERROR: Electric.Replication.PostgresConnector  │
supabase-electric  | ▓ ┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
supabase-electric  | ▓ 
supabase-electric  | ▓ Failed to start child :postgres_producer:
supabase-electric  | ▓   {:bad_return_value, {:error, {:error, :error, "55000", :object_not_in_prerequisite_state, "cannot read from logical replication slot \"electric_replication_out_postgres\"", [detail: "This slot has been invalidated because it exceeded the maximum reserved size.", file: "walsender.c", line: "1269", routine: "StartLogicalReplication", severity: "ERROR"]}}}
supabase-electric  | ▓ 
supabase-electric  | ▓ Please file a new issue on GitHub[1], including the contents of this error.
supabase-electric  | ▓ 
supabase-electric  | ▓ [1]: https://github.com/electric-sql/electric/issues
supabase-electric  | 18:42:35.581 pid=<0.2803.0> origin=postgres_1 [error] PostgresConnectorSup failed to start child :postgres_producer with reason: {:bad_return_value, {:error, {:error, :error, "55000", :object_not_in_prerequisite_state, "cannot read from logical replication slot \"electric_replication_out_postgres\"", [detail: "This slot has been invalidated because it exceeded the maximum reserved size.", file: "walsender.c", line: "1269", routine: "StartLogicalReplication", severity: "ERROR"]}}}.
supabase-electric  | 
supabase-electric  | ••• Shutting down •••
supabase-electric  | 
supabase-electric  | [os_mon] memory supervisor port (memsup): Erlang has closed
supabase-electric  | [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
supabase-electric exited with code 1

linear[bot] commented 6 months ago

VAX-1755 "This slot has been invalidated because it exceeded the maximum reserved size" error

djbutler commented 6 months ago

This appears to have been caused by me accidentally sending a very large number of SQL queries to the Electric migration proxy rather than to my Postgres database (where I intended to send them). Easy mistake to make... might be worth handling this case more gracefully.

hongsw-aibizon commented 6 months ago

@djbutler How did you get things back to normal? I also encountered the same error while using the createMany function.

djbutler commented 6 months ago

I ran the SQL given on the “Roadmap” page for deleting the database structures created by electric. When you start the electric service again, it will re-created the needed structures. The errors stopped. But unfortunately, when I re-electrified my tables, the data already in the tables does not sync, so that’s the next thing to debug. New data added to the tables syncs fine.

On Wed, Mar 27, 2024 at 11:54 PM 승우 @.***> wrote:

@djbutler https://github.com/djbutler How did you get things back to normal?

— Reply to this email directly, view it on GitHub https://github.com/electric-sql/electric/issues/1089#issuecomment-2024523375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABKG4KRPM6V4V5HZIPAXMLY2O5CVAVCNFSM6AAAAABFE3ZROCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGUZDGMZXGU . You are receiving this because you were mentioned.Message ID: @.***>

djbutler commented 6 months ago

Also, if I didn’t mention it earlier, the root cause seems to be that I accidentally issued a ton of data-intensive queries to the migration proxy that were in fact intended for the underlying PSQL database. That messed up electric.

On Thu, Mar 28, 2024 at 9:07 AM Dan Butler @.***> wrote:

I ran the SQL given on the “Roadmap” page for deleting the database structures created by electric. When you start the electric service again, it will re-created the needed structures. The errors stopped. But unfortunately, when I re-electrified my tables, the data already in the tables does not sync, so that’s the next thing to debug. New data added to the tables syncs fine.

On Wed, Mar 27, 2024 at 11:54 PM 승우 @.***> wrote:

@djbutler https://github.com/djbutler How did you get things back to normal?

— Reply to this email directly, view it on GitHub https://github.com/electric-sql/electric/issues/1089#issuecomment-2024523375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABKG4KRPM6V4V5HZIPAXMLY2O5CVAVCNFSM6AAAAABFE3ZROCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGUZDGMZXGU . You are receiving this because you were mentioned.Message ID: @.***>

alco commented 5 months ago

Hey @djbutler. Thanks for reporting this!

By default, Postgres does not put a limit on replication slot sizes. Are you using a hosted database or do you set the relevant configuration option in your Postgres config?

On the issue of disk usage growth caused by the replication slot Electric creates: this is, in fact, a hairy problem. You'll find more relevant info here - https://github.com/electric-sql/electric/issues/1083. TL;DR, any write to PG will cause Postgres disk usage to grow, even if the write itself does not touch electrified tables. In the next release of Electric, there will be a new configuration option to limit the maximum size of disk space Electric's replication slot can retain.

djbutler commented 5 months ago

Thanks for the explanation. Yeah this would be great to fix, since it's come up a few times for me.

This db is self-hosted on EC2 - it's a Supabase docker image. I don't know if any of these look relevant:

name	setting	unit	category	short_desc
auto_explain.log_wal	off		Customized Options	Log WAL usage.
max_slot_wal_keep_size	1024	MB	Replication / Sending Servers	Sets the maximum WAL size that can be reserved by replication slots.
max_wal_senders	10		Replication / Sending Servers	Sets the maximum number of simultaneously running WAL sender processes.
max_wal_size	1024	MB	Write-Ahead Log / Checkpoints	Sets the WAL size that triggers a checkpoint.
min_wal_size	80	MB	Write-Ahead Log / Checkpoints	Sets the minimum size to shrink the WAL to.
track_wal_io_timing	off		Statistics / Cumulative Query and Index Statistics	Collects timing statistics for WAL I/O activity.
wal_block_size	8192		Preset Options	Shows the block size in the write ahead log.
wal_buffers	512	8kB	Write-Ahead Log / Settings	Sets the number of disk-page buffers in shared memory for WAL.
wal_compression	off		Write-Ahead Log / Settings	Compresses full-page writes written in WAL file with specified method.
wal_consistency_checking			Developer Options	Sets the WAL resource managers for which WAL consistency checks are done.
wal_decode_buffer_size	524288	B	Write-Ahead Log / Recovery	Buffer size for reading ahead in the WAL during recovery.
wal_init_zero	on		Write-Ahead Log / Settings	Writes zeroes to new WAL files before first use.
wal_keep_size	0	MB	Replication / Sending Servers	Sets the size of WAL files held for standby servers.
wal_level	logical		Write-Ahead Log / Settings	Sets the level of information written to the WAL.
wal_log_hints	off		Write-Ahead Log / Settings	Writes full pages to WAL when first modified after a checkpoint, even for a non-critical modification.
wal_receiver_create_temp_slot	off		Replication / Standby Servers	Sets whether a WAL receiver should create a temporary replication slot if no permanent slot is configured.
wal_receiver_status_interval	10	s	Replication / Standby Servers	Sets the maximum interval between WAL receiver status reports to the sending server.
wal_receiver_timeout	60000	ms	Replication / Standby Servers	Sets the maximum wait time to receive data from the sending server.
wal_recycle	on		Write-Ahead Log / Settings	Recycles WAL files by renaming them.
wal_retrieve_retry_interval	5000	ms	Replication / Standby Servers	Sets the time to wait before retrying to retrieve WAL after a failed attempt.
wal_segment_size	16777216	B	Preset Options	Shows the size of write ahead log segments.
wal_sender_timeout	60000	ms	Replication / Sending Servers	Sets the maximum time to wait for WAL replication.
wal_skip_threshold	2048	kB	Write-Ahead Log / Settings	Minimum size of new file to fsync instead of writing WAL.
wal_sync_method	fdatasync		Write-Ahead Log / Settings	Selects the method used for forcing WAL updates to disk.
wal_writer_delay	200	ms	Write-Ahead Log / Settings	Time between WAL flushes performed in the WAL writer.
wal_writer_flush_after	128	8kB	Write-Ahead Log / Settings	Amount of WAL written out by WAL writer that triggers a flush.

ALTER SYSTEM SET max_slot_wal_keep_size = '5GB'; doesn't throw any errors, but also doesn't seem to have any effect. If you have any pointers for how to clear my WAL and increase the limit, that would be amazing - otherwise I'll just try to troubleshoot it myself.

alco commented 5 months ago

@djbutler Thanks for sharing your config!

The max_slot_wal_keep_size setting is the reason you got the error about the replication slot getting invalidated. By default, its value is set to -1 in Postgres, so replication slots can retain as much WAL as there is available disk space. Chaging the value of this setting won't have any effect until the Postgres instance is restarted, per the docs:

This parameter can only be set in the postgresql.conf file or on the server command line.

Even then, I guess once a replication slot has been invalidated, it can no longer be used. You can drop Electric's slot with SELECT pg_drop_replication_slot('electric_replication_out_<db name>') and restart Electric to let it create a new one. You can also query the database for all replication slots to double-check that the name is right:

SELECT * FROM pg_replication_slots;

In the upcoming release of Electric we're introducing a configuration parameter to set the upper limit on WAL size that Electric can retain, see the preview here.

djbutler commented 5 months ago

Thanks @alco, I appreciate the detailed explanation. One quick question about the preview you linked to - it says:

Setting this to a low value may lead to clients having to discard their local copy of the server state and restart their replication state from scratch.

So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?

Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it? That would be pretty bad - it could lead to data loss. I assume it's just a performance issue - the client will have to download the server state again, but won't lose pending local transactions, right?

alco commented 5 months ago

So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?

Electric does not compute diffs currently. It can stream transactions from Postgres' WAL and, provided there's no gap between what it has already sent to a given client and what it can stream from Postgres, it will keep streaming new transactions to the client.

But if the replication slot is removed, Postgres will discard old WAL segments and so, upon creating a new replication slot, Electric will only be able to resuming streaming from the latest state in Postgres. If it detects that there's a gap between the last transaction the client received and the earliest transaction Electric can stream from Postgres, it will ask the client to reset its local state to force it to request the latest snapshot of the shape data from the database before going back to the transaction streaming mode.

Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it?

The idea is that client's local updates will survive the reset. This is a gray area currently since we don't yet have sufficient test coverage for this edge case but we'll get there eventually.

djbutler commented 5 months ago

Okay got it, thanks!

KyleAMathews commented 2 months ago

👋 we've been working the last month on a rebuild of the Electric server over at a temporary repo https://github.com/electric-sql/electric-next/

You can read more about why we made the decision at https://next.electric-sql.com/about

We're really excited about all the new possibilities the new server brings and we hope you'll check it out soon and give us your feedback.

We're now moving the temporary repo back here. As part of that migration we're closing all the old issues and PRs. We really appreciate you taking the time to investigate and report this issue!

electric-sql / electric

"This slot has been invalidated because it exceeded the maximum reserved size" error #1089