[Bug] start_full_node process writes ~62GB per DAY to its sqlite DB files

telno commented 2 years ago

What happened?

20200505 chia disk activity

This screenshot is for 2d6h of uptime, though I've sampled multiple times over the last couple weeks with similar result. This is everyday usage of my node, no DB migrations or other special scenarios going on.

The entire v2 database is 71GB, so almost the entire DB being rewritten every day. Submitting for review of possible bug in how the database is being accessed/updated.

Using fs_usage to get a breakout of files written results in:

~62% blockchain_v2_mainnet.sqlite
~21% blockchain_v2_mainnet.sqlite-wal
~16% private/var/folders/hy/ymzm1p7d4cb49gb5d2kfpc80000gn/T//etilqs*
~0.4% debug.log (at DEBUG log level)
~0.3% 0.36 /Users//.chia/mainnet/db/tmp*

Version

1.3.4

What platform are you using?

macOS

What ui mode are you using?

GUI

Relevant log output

No response

jmhands commented 2 years ago

15:33:46 up 34 days, 14:31,  1 user,  load average: 1.23, 1.08, 1.02
5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

iostat -h /dev/nvme0n1
Linux 5.4.0-107-generic (wolf23085)     05/06/2022      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          24.1%    0.0%    0.7%    0.1%    0.0%   75.1%

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
    67.49        84.3k       427.9k        68.5k     240.4G       1.2T     195.2G nvme0n1

this is full node and single wallet (that has been dusted many times as well) over the last 34 days. 427KiB/s, or 37.8GB per day. This is normal wear for database SSD and we have mentioned before well in the normal range for consumer SSD endurance.

telno commented 2 years ago

Database is 72GB. Including yours, reports in the wild have ranged 37.8GB - 62GB writes per day, or roughly 50-86% of the entire database per day.

This is less a question of drive endurance than it is an implementation and scalability concern. A database getting 50-86% rewritten each day should be concerning whether or not you can personally justify the cost and impact to your personal SSD hardware.

There is incremental work of adding blocks and coins to the db and updating db indexes, but these do not explain the high writes. When the db is 200GB, can we still expect daily writes of 50%+?

It appears the db is having to be recalculated and/or reorganized nearly constantly, even though incremental blocks/coins/indexing should amount to a much smaller amount of data.

Other recent reports:

telno commented 2 years ago

Really appreciate you guys touching on this in today's AMA. Agreed with the point that the current IOPS is due to how SQLite is doing internal processing of indexes/etc after Chia insert/update calls. Chia demands are not causing the high IOPS.

There's an opportunity to reach out to the SQLite folks, though, as there are likely optimizations that could easily remedy the issue.

For example, in MS SQL server (only one I have personal experience), primary key affects the physical sorting of records on disk, so having a BLOB primary key which is itself a hash value would cause tremendous amounts of writes while re-ordering pages of data on disk.

The statement "that's just how databases are" is often because of incorrect assumptions when defining the table/index schemas.

emlowe commented 2 years ago

This is true, we aren't exact SQL deep-experts (as many community people will attest). Right now at this moment, the writes are well within the capabilities of any consumer SSD. However, yes we are very aware this may not always be the case and we continue to investigate improvements to the schema, as well as alternatives to sqlite.

So we don't currently see this as a "bug" per se, but it is something we are aware of and hope to improve

smasharov commented 2 years ago

In my opinion DB operations should be changed somehow to reduce the number of IO operations without a corruptions on any failure. I think latest transactions must be stored in memory with configurable amount of it and write to the main DB should be postponed to do it in batches with less IO operations. It is not a problem if some of the transactions will be lost on failure, it will be easy to catch them in future. But number of writes to the main DB must be reduced.

I have lost 25% of my NVMe resource whiting a month, now I am using HDD for DB, but it it is very slow and sometimes the node is late up to several hours.

github-actions[bot] commented 2 years ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

telno commented 2 years ago

Overall, it seems there is much room for optimizing the SQLite DB to minimize IOPS. If Chia continues to commit to SQLite then perhaps a formal review by a SQLite expert could address these inefficiencies.

Attaching insights gleaned from the SQLite official forum and the r/sqlite sub-Reddit:

Importantly: Given that the write volume is so dramatically larger than the actual incremental daily additions to the database, a majority of writes appear to be of a temporary nature. The following are some of the sources of temporary data:

Temporary tables (could point to inefficient query/index combinations that cause large amounts of data to be materialized into views)
Transactions (would mainly affect the WAL journal file)
Autocommit mode (could lead to unnecessary writes)
Savepoints
Auto vacuum

Inserts/updates/deletes cause at least a page worth of data to be written twice, even if you just flip one bit in a single record. If the write touches indexed data, then a page or more will be written twice as well per index. An insert in the table coin_records for example will write at least 10 page sizes of data (1+4 pages, twice) or 40KB, for a ~100 byte record
There are many indexes which all must be kept in sync with the tables, and changes regularly cause an index to be rebalanced.
Some of these indexes might not actually be used by the queries, and a wrong index can have a huge negative impact on performance. It might help to drop all indexes, then add them one by one and measure their influence on a given query's performance.
Another cause of unnecessary writes (though not this much) could be unnecessary commits due to autocommit mode which is the default in SQLite.

github-actions[bot] commented 2 years ago

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

github-actions[bot] commented 2 years ago

This issue was automatically closed because it has been flagged as stale, and subsequently passed 7 days with no further activity from the submitter or watchers.

eliliam commented 1 year ago

I'd love an update here from the Chia team, how can we reduce the amount of transient modifications happening in the DB? There's got to be a better way to do this, since in theory I'd imagine the vast majority of operations are inserts, which just append to a table.

emlowe commented 1 year ago

There are currently no planned improvements.

andymarden commented 7 months ago

Why not? This is becoming a big issue - surely moving even to mariadb would be a step in the right direction? I guarantee this is due to the structure of the database in sqlite and the inefficiencies of dealing with it. Moving to a proper rdbms would enable (a) better performance off the bat and a better chance to optimise the database from the community.

Everyone seems obsessed with SSD wear which is relevant in plotting but should be completely irrelevant in the database performance of farming/harvesting.

I have been running my full node on a Dell Poweredge R610. It has a RAID memory cache, 400GB SSD RAID cache and 10k SAS drives. The dedicated LXC container (Proxmox) that chia is in has two CPU threads assigned and 4GB memory. Out of that it uses about 10% CPU and 2GB memory. Peanuts.

But when the full node is running, it accounts for 96% of IO.

This server has 15 other LXCs and 3 full blown VMs - it has rsyslog running constantly, promtheus, a Postgrsql LXC with about 8 services all using it to read and write data. OK not as intensive as chia but that really is pitiful.

I don't understand the mental block - do you want people to buy more hardware for the sake of it?

Chia-Network / chia-blockchain