IntersectMBO / cardano-db-sync

A component that follows the Cardano chain and stores blocks and transactions in PostgreSQL
Apache License 2.0
289 stars 162 forks source link

Proposal for cardano-db-sync: Consider Migration to ClickHouse for Enhanced Performance and Scalability #1586

Open sunapi386 opened 9 months ago

sunapi386 commented 9 months ago

Summary

I have noticed that the cardano-db-sync database (dbsync) has grown significantly in size, currently occupying around 450GB as of end of 2022. Queries take noticable latency, in the seconds to 10s of seconds. In the interest of future scalability and performance, particularly in relation to disk space usage and query efficiency, I am suggesting cardano-db-sync consider switching to a columnar based database.

ClickHouse

I've had good experience with ClickHouse and I'd recommend taking a look to this. Given the columnar nature of ClickHouse and its performance advantages for certain types of queries and data analytics, I am suggesting migrating to a ClickHouse-based solution.

Call for comments

Cmdv commented 7 months ago

@sunapi386 you are right in that the size of the database is exponentially getting larger. I'll attempt to answer some of your concerns as best I can.

From my understandings one of the limitations of a columnar tables is their inability to perform UPDATE, DELETE operations. Two key areas where these operations are used would be during the initial sync period of cardano-db-sync reaching the tip of the chain. The second being handling rollbacks.

Now admittedly rollbacks have a limit in how far back they can go, so theoretically we could have columnar up to the rollback limit. Thinking about our internal structure this would require quite the effort and we'd only be able to see if it worked on completion. (should add this is high level observation from doing research (here, here) about columnar, as I'm not a db expert)

There are recommended best practices or configurations for optimizing the cardano-db-sync PostgreSQL database to handle large datasets more efficiently, such as compression?

We have been working on performance improvements in both syncing and querying. One key area has been to simply not store date that isn't needed for the majority of queries/ user cases. We provide this via command line flags (soon to be moved into a config file)

Taking a look at those documentation for the flags one can see how the amount of data stored is decreased thus helping syncing speeds. Would you be able to give example queries that you find are giving you these performance issues?

Ultimately, I agree that we should start looking into longer term solution for handling the fact that the db is only going to get larger thus slowing things down and this is a good place to start that conversation.

Any efforts are underway or planned for supporting alternative database backends like ClickHouse for cardano-db-sync

Due to the current nature of our codebase we are tied to postgreSQL from the libraries we use. Though from my understanding the queries shouldn't change so one could start from scratch install a postgreSQL db and install the citus extension. But my feeling is syncing speeds will be extremely long. We do provide snapshots but obviously these are for current implementation so wouldn't work with columnar tables.