ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.81k stars 887 forks source link

autoclean performance #6745

Open michael1011 opened 11 months ago

michael1011 commented 11 months ago

The autoclean-once command is really slow and uses a lot of memory. Especially the memory consumption is concerning and could become a serious problem for bigger nodes.

image

We are running a CLN node with Postgres database and after a month of operation, I wanted to clean failed forwards from the database, because that data isn't relevant for us.

I ran autoclean-once and that command did take > 1 hour to run and allocated quite a bit of memory that wasn't released after the command was done. Also, it seems like autoclean removes rows from the database either individually or in very small batches, which is not optimal for nodes running with a synchronous replication Postgres database.

grubles commented 11 months ago

Can confirm the relatively high memory usage. This is with an sqlite database.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                          
 448811 nil       20   0 2619072   2.4g      0 S   6.3   3.9   1:29.69 autoclean 

autoclean-once did not take an hour+ in my case, though I do have a beefy machine.

"cleaned": 1802687,
michael1011 commented 11 months ago

though I do have a beefy machine.

So do we. But I think Postgres with sync replication adds lot of latency when you delete one entry, commit and then have to wait for the network to ensure that the replica is in sync. It's not bottlenecked by CPU, RAM or disk.

My personal CLN with SQLite was much faster doing an autoclean

vincenzopalazzo commented 11 months ago

Autoclen is @rustyrussell baby, maybe he has an idea on how to speed up this. I can take a look regarding the memory usage btw

rustyrussell commented 11 months ago

So, autoclean specifically does delete individually, but also disables transactions opportunistically (they'll still occur if other things need transactions, but if it's back-to-back autoclean they'll be merged). This allows for a much more controlled deletion than if we restricted it to simple database queries, though it's slower.

However, it will happily throw a million simultaneous delete commands at CLN, if it finds that many records! That's going to take a lot of memory. We should stage it into (say) no more than 1000 at once, then wait until those are processed. This will require a rework of the infrastructure inside the plugin, however.

michael1011 commented 11 months ago

This allows for a much more controlled deletion than if we restricted it to simple database queries, though it's slower.

I am not exactly sure what you mean by that. When I tell autoclean to for example delete all failed forwards older than one day, there is no control needed apart from WHERE state = ? AND resolved_time < ?.

However, it will happily throw a million simultaneous delete commands at CLN, if it finds that many records! That's going to take a lot of memory.

This implies that the memory consumption grows linearly with the number of records to delete, which could make it unusable for big nodes with lots of entries that aren't removed regularly

vincenzopalazzo commented 11 months ago

This implies that the memory consumption grows linearly with the number of records to delete, which could make it unusable for big nodes with lots of entries that aren't removed regularly

Correct but this is a bug of our implementation, it is not a limitation of the solution. We just should iterate over the entry by batch, so load the first 1k and run the code on it.

I will put it on my todo :)

whitslack commented 2 months ago

This allows for a much more controlled deletion than if we restricted it to simple database queries, though it's slower.

What criteria are you imagining that you can't code in the WHERE clause of a DELETE statement? Pulling every record from the database table, only to turn around and issue millions of individual DELETE statements, is simply not the way to do it.

which could make it unusable for big nodes with lots of entries that aren't removed regularly

Indeed. I cannot use the autoclean plugin because it OOMs my machine.

kilrau commented 2 months ago

Any updates? @vincenzopalazzo

michael1011 commented 2 months ago

Was this fixed by https://github.com/ElementsProject/lightning/pull/7298?