Database segmentation - Githubissues

0xB10C / bademeister-go

Implementation of Bademeister, a mempool watcher and recorder, in Golang.

0 stars 3 forks source link

Database segmentation #13

Open OttoAllmendinger opened 5 years ago

OttoAllmendinger commented 5 years ago

Current data suggests we are looking at a minimum of ~20mb of daily uncompressed data (~3mb lzma-compressed data). This is currently only counting first_seen timestamp without any other metadata.

In order to avoid having a single file with unconstrained growth, we could write confirmed transactions into a separate database confirmed_$n.db (with different $n for every 100 blocks or so) which can be compressed for long-term storage.

This allows a simple way of sharing datasets by copying mempool.db.xz and the most-recent confirmed_$n.db.xz

0xB10C commented 5 years ago

Concept ACK for archival SQLite storage. Best would be an implementation so that future DBMS (e.g. PostgreSQL #12) can be treated differently.

Big question that remains:

How to efficiently query over multiple database files?

0xB10C commented 5 years ago

Current data suggests we are looking at a minimum of ~20mb of daily uncompressed data

Additionally I'd like to have a second look at that with INTEGER values for dates. I think the timestamps as TEXT account for a lot of uncompressable data. https://github.com/0xB10C/bademeister-go/issues/6#issuecomment-547148606

0xB10C commented 5 years ago

Current data suggests we are looking at a minimum of ~20mb of daily uncompressed data

Additionally I'd like to have a second look at that with INTEGER values for dates. I think the timestamps as TEXT account for a lot of uncompressable data. #6 (comment)

While there is a significant reduction (12.5% with txids and 50% with txids removed at 800k tx https://github.com/0xB10C/bademeister-go/issues/6#issuecomment-547173130), some kind of database segmentation is still needed.

OttoAllmendinger commented 5 years ago

Best would be an implementation so that future DBMS (e.g. PostgreSQL #12) can be treated differently.

Agreed. However I would argue a feature like this would make sense even if the primary storage method is not sqlite. Even with other backends it can be useful to periodically prune entries from the main database and move them into a backup file (ideally one that can be shared).

The project probably needs some import/export method anyway for users who only want to analyze some subset of collected data without monitoring the mempool. I think a set of compressed sqlite databases would work pretty well for that. Querying over a series of compressed archives can indeed be a challenge, so maybe users need to import the archives into their main database first.

0xB10C commented 5 years ago

Querying over a series of compressed archives can indeed be a challenge, so maybe users need to import the archives into their main database first.

Agree

0xB10C commented 5 years ago

Then let's go with SQL tables similar to

mempool_tx(txid, firstSeen, fee, size, output amount)
confirmed_tx(height, index, firstSeen, fee, size, output amount, seen_in_mempool)
block(heigth, hash, ...)

After x-confirmations (e.g. 144 or maybe lower like 20) a tx is moved from from mempool_tx to confirmed_tx. Every n blocks (e.g. block_height % 500 == 0) a export of the block + confirmed_tx tables in the range is generated and moved to a seperate database file.