ACINQ / eclair

A scala implementation of the Lightning Network.
Apache License 2.0
1.24k stars 265 forks source link

Replace string DB fields with more appropriate data types #2932

Open DerEwige opened 1 month ago

DerEwige commented 1 month ago

For example you chose to store the UUIDs in text fields Which make them use 36 byte instead of 16 byte if you would use the built in UUIDdata type If you would revamp to use BIGINT where it is possible instead you would go down to 8 byte.

This would shrink the database (and the indexes in particularly) and make the whole DB a lot faster.

t-bast commented 1 week ago

That choice was made because we currently favor usability/readability (when browsing the DB using standard DB tools) over performance. We've found that the performance hit wasn't noticeable, so we haven't changed this. We can change this in the future when performance starts becoming a real bottleneck, but that isn't a priority at all right now. Also, this should be motivated by reproducible benchmarks to ensure that the performance gain is worth the DB migration complexity.

DerEwige commented 1 week ago

There is only one table that is really a problem right now

htlc_info

At 250 million entries it reached a size of 100GB and the index is more than 25 GB (which is the main problem). Because of this the index does not fit into my memory anymore which causes huge performance problems.

Especially if a compaction is running while éclair is deleting rows form that table the node suffers heavy performance loss.

This is really the only table that causes problem and could benefit from some optimization.

image

t-bast commented 1 week ago

Gotcha, this is indeed the table that grows the most as HTLCs are processed. It gets better with splicing, which lets us remove old entries after a splice, but it's still that DB that will grow the most.

It's a bit annoying to update, because if we want to change the channel_id field, we also have to change the channels DB since we reference it as foreign key. Changing the payment_hash field would be trivial though. We can consider doing this if it's scoped to only those tables.

DerEwige commented 1 week ago

For channel_id you could replace it with an field called channel_nr which could be just a BIGINT And then have a separat table where you map channel_id to channel_nr

makes the corellation a big more complex (needs 2 steps or a join)

But could reduce the size of the table (space on diks) and also the index by a lot (just some ideas)

I'll try to do some peromance test on that with 50m+ insersts and deleets in the next few weeks and post results in here, once I'm done

DerEwige commented 1 week ago

I made some small tests and I think I found the main issue and a possible solution

You are using a composite index of channel_idand payment_hash. I looked into the codebase and in theory that is the best solution, for the kind of queries you are making. (Two sperate indexes would invoke extra CPU cost when doing the queries.)

But the index gets so big, it does consume 30-50% of the total table space. This gets problematic once your table grows to several GB in size.

Composite index vs 2 separate indexes For small tables the difference is negligible, but it increases with larger tables. But if the table gets too big and the index does not fit into memory any longer the benefit of the composite index gets far outweighed by the increased I/O that is needed to read the index from the disk.

Proposed optimization Changing payment_hash from TEXT to BYTEA with a constraint on length 32 and splitting up the composite index into two separate indexes resulted in the following:

Decrease of total table size by more than 50% Decrease of index size by more than 85%

Small 100k table before optimization 36 MB total 17 MB index

Small 100k table after optimization

17 MB total 2.2 MB index

for my table this would mean reduction of the index from 25 GB to about 3.5 GB

DerEwige commented 6 days ago

I have analyzed it some more:

You can use this SQL statement on your PG instance to verify my findings

SELECT relname AS table_name,
schemaname AS schema_name,
coalesce(seq_scan, 0) + coalesce(idx_scan, 0) AS total_selects, 
n_tup_ins AS inserts,
n_tup_upd AS updates,
n_tup_del AS deletes
FROM pg_stat_all_tables
WHERE schemaname = 'local'
ORDER by (coalesce(seq_scan,0) + coalesce(idx_scan,0) + n_tup_ins + n_tup_upd + n_tup_del) DESC;

image

It is basically only inserts and deletes on htlc_info. It is rarely selected from (I guess only on startup and some other occasions)

So, turning the compound index into 2 separate indexes would not hurt performance at all even when the compound index fits into the memory.

The least invasive version would be to leave channel_id and payment_hash untouched and just delete the current compound index and create 2 separate indexes for channel_id and commitment_number.

This would lead to no changes in the code (for accessing the DB) and minimal DB migration.

This would shrink the total DB by about 45% and the index would shrink by about 85%

t-bast commented 5 days ago

Thanks for the investigation!

The least invasive version would be to leave channel_id and payment_hash untouched and just delete the current compound index and create 2 separate indexes for channel_id and commitment_number.

Sounds acceptable, do you want to open a PR for that?

This would lead to no changes in the code (for accessing the DB) and minimal DB migration.

Is the DB migration really minimal? Re-building two indices on the htlc_infos table requires reading the whole table, doesn't it? We'd need to check how long that takes on a large node such as ours.

DerEwige commented 5 days ago

Sounds acceptable, do you want to open a PR for that?

Once the investigation is done. I would like to leave the implementation to you guys. My Scala is really lacking.

Is the DB migration really minimal? Re-building two indices on the htlc_infos table requires reading the whole table, doesn't it? We'd need to check how long that takes on a large node such as ours.

I guess every change we do (change datatype of a field, change an index) would require to read the whole table. So, there is no cheap solution here.

You are right rebuilding the index will take some time.

It might be better to add the new indexes before dropping the old one. This would temporarily increase the table size by about 10%, but the building of the new indexes should be much faster, as they are included in the old index (I assume PG would make use of this)

How big is your DB right now? I’ve managed to bring mine down to about 65 million entries by replacing a lot of older channels during 2 sat/vb times.

I could either make a copy or my DB or do a mockup DB with the same size to test the durations. I’ll post results once I have them

t-bast commented 5 days ago

How big is your DB right now? I’ve managed to bring mine down to about 65 million entries by replacing a lot of older channels during 2 sat/vb times.

We "only* have 21 million rows, it's smaller than yours! I'm curious to know what the migration time is to re-create the index in that case.

DerEwige commented 5 days ago

ok. I will do a mockup DB with 20m entries and test it, should take me about 2 hours to do multiple runs. (insering randome data into a DB even at 400k entries in batches takes some time)

DerEwige commented 5 days ago

I did 4 runs in total

TLDR: about 90s to create both new indexes

DB server used is VPS: t4g.medium (2 vCPU, 4 GB Memory) Storage: 3200 IOPS, 250 MB/s throughput Synchronous streaming replication to secondary server up and running

(I assume you have a bit a stronger system)

My live node uses the same DB server. So the test did not have exclusive access to it, but shared its resources I created a mockup table in a separate DB. I used the same create statements for the table and index as you use.

Because of this setup there was quite some variation in duration time for the new indexes.

I created 50 channels with 400’000 commits on each channel. (I could have randomized this, but it should not matter for the test)

I’ll post the slowest run here: After inserting 20m entries:

total size: 7106 MB index size: 3471 MB Insert duration: 571 seconds

Index creation duration: 94.4 seconds Index drop duration: 0.12 seconds

Size after creating new indexes and dropping old index:

total size: 3917 MB index size: 282 MB Commands used to create and drop indexes:

CREATE INDEX htlc_infos_idx1 ON test_table_1(channel_id);           
CREATE INDEX htlc_infos_idx2 ON test_table_1(commitment_number);

DROP INDEX IF EXISTS htlc_infos_idx;

you can use this select to check your size:

SELECT 
        pg_size_pretty(pg_total_relation_size('test_table_1')) AS total_size,
        pg_size_pretty(pg_indexes_size('test_table_1')) AS index_size,
        pg_size_pretty(pg_relation_size('test_table_1')) AS data_size;

(Note: this will show you the size on disk, not the actual size. If your table was bigger in the past and you never did a "full vacuum", it will show you larger size)

t-bast commented 4 days ago

Thanks for investigating this, I made the change in #2946.