Closed djbutler closed 2 months ago
This appears to have been caused by me accidentally sending a very large number of SQL queries to the Electric migration proxy rather than to my Postgres database (where I intended to send them). Easy mistake to make... might be worth handling this case more gracefully.
@djbutler How did you get things back to normal?
I also encountered the same error while using the createMany
function.
I ran the SQL given on the “Roadmap” page for deleting the database structures created by electric. When you start the electric service again, it will re-created the needed structures. The errors stopped. But unfortunately, when I re-electrified my tables, the data already in the tables does not sync, so that’s the next thing to debug. New data added to the tables syncs fine.
On Wed, Mar 27, 2024 at 11:54 PM 승우 @.***> wrote:
@djbutler https://github.com/djbutler How did you get things back to normal?
— Reply to this email directly, view it on GitHub https://github.com/electric-sql/electric/issues/1089#issuecomment-2024523375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABKG4KRPM6V4V5HZIPAXMLY2O5CVAVCNFSM6AAAAABFE3ZROCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGUZDGMZXGU . You are receiving this because you were mentioned.Message ID: @.***>
Also, if I didn’t mention it earlier, the root cause seems to be that I accidentally issued a ton of data-intensive queries to the migration proxy that were in fact intended for the underlying PSQL database. That messed up electric.
On Thu, Mar 28, 2024 at 9:07 AM Dan Butler @.***> wrote:
I ran the SQL given on the “Roadmap” page for deleting the database structures created by electric. When you start the electric service again, it will re-created the needed structures. The errors stopped. But unfortunately, when I re-electrified my tables, the data already in the tables does not sync, so that’s the next thing to debug. New data added to the tables syncs fine.
On Wed, Mar 27, 2024 at 11:54 PM 승우 @.***> wrote:
@djbutler https://github.com/djbutler How did you get things back to normal?
— Reply to this email directly, view it on GitHub https://github.com/electric-sql/electric/issues/1089#issuecomment-2024523375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABKG4KRPM6V4V5HZIPAXMLY2O5CVAVCNFSM6AAAAABFE3ZROCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGUZDGMZXGU . You are receiving this because you were mentioned.Message ID: @.***>
Hey @djbutler. Thanks for reporting this!
By default, Postgres does not put a limit on replication slot sizes. Are you using a hosted database or do you set the relevant configuration option in your Postgres config?
On the issue of disk usage growth caused by the replication slot Electric creates: this is, in fact, a hairy problem. You'll find more relevant info here - https://github.com/electric-sql/electric/issues/1083. TL;DR, any write to PG will cause Postgres disk usage to grow, even if the write itself does not touch electrified tables. In the next release of Electric, there will be a new configuration option to limit the maximum size of disk space Electric's replication slot can retain.
Thanks for the explanation. Yeah this would be great to fix, since it's come up a few times for me.
This db is self-hosted on EC2 - it's a Supabase docker image. I don't know if any of these look relevant:
name | setting | unit | category | short_desc |
---|---|---|---|---|
auto_explain.log_wal | off | Customized Options | Log WAL usage. | |
max_slot_wal_keep_size | 1024 | MB | Replication / Sending Servers | Sets the maximum WAL size that can be reserved by replication slots. |
max_wal_senders | 10 | Replication / Sending Servers | Sets the maximum number of simultaneously running WAL sender processes. | |
max_wal_size | 1024 | MB | Write-Ahead Log / Checkpoints | Sets the WAL size that triggers a checkpoint. |
min_wal_size | 80 | MB | Write-Ahead Log / Checkpoints | Sets the minimum size to shrink the WAL to. |
track_wal_io_timing | off | Statistics / Cumulative Query and Index Statistics | Collects timing statistics for WAL I/O activity. | |
wal_block_size | 8192 | Preset Options | Shows the block size in the write ahead log. | |
wal_buffers | 512 | 8kB | Write-Ahead Log / Settings | Sets the number of disk-page buffers in shared memory for WAL. |
wal_compression | off | Write-Ahead Log / Settings | Compresses full-page writes written in WAL file with specified method. | |
wal_consistency_checking | Developer Options | Sets the WAL resource managers for which WAL consistency checks are done. | ||
wal_decode_buffer_size | 524288 | B | Write-Ahead Log / Recovery | Buffer size for reading ahead in the WAL during recovery. |
wal_init_zero | on | Write-Ahead Log / Settings | Writes zeroes to new WAL files before first use. | |
wal_keep_size | 0 | MB | Replication / Sending Servers | Sets the size of WAL files held for standby servers. |
wal_level | logical | Write-Ahead Log / Settings | Sets the level of information written to the WAL. | |
wal_log_hints | off | Write-Ahead Log / Settings | Writes full pages to WAL when first modified after a checkpoint, even for a non-critical modification. | |
wal_receiver_create_temp_slot | off | Replication / Standby Servers | Sets whether a WAL receiver should create a temporary replication slot if no permanent slot is configured. | |
wal_receiver_status_interval | 10 | s | Replication / Standby Servers | Sets the maximum interval between WAL receiver status reports to the sending server. |
wal_receiver_timeout | 60000 | ms | Replication / Standby Servers | Sets the maximum wait time to receive data from the sending server. |
wal_recycle | on | Write-Ahead Log / Settings | Recycles WAL files by renaming them. | |
wal_retrieve_retry_interval | 5000 | ms | Replication / Standby Servers | Sets the time to wait before retrying to retrieve WAL after a failed attempt. |
wal_segment_size | 16777216 | B | Preset Options | Shows the size of write ahead log segments. |
wal_sender_timeout | 60000 | ms | Replication / Sending Servers | Sets the maximum time to wait for WAL replication. |
wal_skip_threshold | 2048 | kB | Write-Ahead Log / Settings | Minimum size of new file to fsync instead of writing WAL. |
wal_sync_method | fdatasync | Write-Ahead Log / Settings | Selects the method used for forcing WAL updates to disk. | |
wal_writer_delay | 200 | ms | Write-Ahead Log / Settings | Time between WAL flushes performed in the WAL writer. |
wal_writer_flush_after | 128 | 8kB | Write-Ahead Log / Settings | Amount of WAL written out by WAL writer that triggers a flush. |
ALTER SYSTEM SET max_slot_wal_keep_size = '5GB';
doesn't throw any errors, but also doesn't seem to have any effect. If you have any pointers for how to clear my WAL and increase the limit, that would be amazing - otherwise I'll just try to troubleshoot it myself.
@djbutler Thanks for sharing your config!
The max_slot_wal_keep_size
setting is the reason you got the error about the replication slot getting invalidated. By default, its value is set to -1
in Postgres, so replication slots can retain as much WAL as there is available disk space. Chaging the value of this setting won't have any effect until the Postgres instance is restarted, per the docs:
This parameter can only be set in the postgresql.conf file or on the server command line.
Even then, I guess once a replication slot has been invalidated, it can no longer be used. You can drop Electric's slot with SELECT pg_drop_replication_slot('electric_replication_out_<db name>')
and restart Electric to let it create a new one. You can also query the database for all replication slots to double-check that the name is right:
SELECT * FROM pg_replication_slots;
In the upcoming release of Electric we're introducing a configuration parameter to set the upper limit on WAL size that Electric can retain, see the preview here.
Thanks @alco, I appreciate the detailed explanation. One quick question about the preview you linked to - it says:
Setting this to a low value may lead to clients having to discard their local copy of the server state and restart their replication state from scratch.
So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?
Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it? That would be pretty bad - it could lead to data loss. I assume it's just a performance issue - the client will have to download the server state again, but won't lose pending local transactions, right?
So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?
Electric does not compute diffs currently. It can stream transactions from Postgres' WAL and, provided there's no gap between what it has already sent to a given client and what it can stream from Postgres, it will keep streaming new transactions to the client.
But if the replication slot is removed, Postgres will discard old WAL segments and so, upon creating a new replication slot, Electric will only be able to resuming streaming from the latest state in Postgres. If it detects that there's a gap between the last transaction the client received and the earliest transaction Electric can stream from Postgres, it will ask the client to reset its local state to force it to request the latest snapshot of the shape data from the database before going back to the transaction streaming mode.
Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it?
The idea is that client's local updates will survive the reset. This is a gray area currently since we don't yet have sufficient test coverage for this edge case but we'll get there eventually.
Okay got it, thanks!
👋 we've been working the last month on a rebuild of the Electric server over at a temporary repo https://github.com/electric-sql/electric-next/
You can read more about why we made the decision at https://next.electric-sql.com/about
We're really excited about all the new possibilities the new server brings and we hope you'll check it out soon and give us your feedback.
We're now moving the temporary repo back here. As part of that migration we're closing all the old issues and PRs. We really appreciate you taking the time to investigate and report this issue!
My electric service crashed, and when I try to restart it with docker compose, I get this: