Open OttoAllmendinger opened 5 years ago
Concept ACK for archival SQLite storage. Best would be an implementation so that future DBMS (e.g. PostgreSQL #12) can be treated differently.
Big question that remains:
Current data suggests we are looking at a minimum of ~20mb of daily uncompressed data
Additionally I'd like to have a second look at that with INTEGER
values for dates. I think the timestamps as TEXT
account for a lot of uncompressable data. https://github.com/0xB10C/bademeister-go/issues/6#issuecomment-547148606
Current data suggests we are looking at a minimum of ~20mb of daily uncompressed data
Additionally I'd like to have a second look at that with
INTEGER
values for dates. I think the timestamps asTEXT
account for a lot of uncompressable data. #6 (comment)
While there is a significant reduction (12.5% with txids and 50% with txids removed at 800k tx https://github.com/0xB10C/bademeister-go/issues/6#issuecomment-547173130), some kind of database segmentation is still needed.
Best would be an implementation so that future DBMS (e.g. PostgreSQL #12) can be treated differently.
Agreed. However I would argue a feature like this would make sense even if the primary storage method is not sqlite. Even with other backends it can be useful to periodically prune entries from the main database and move them into a backup file (ideally one that can be shared).
The project probably needs some import/export method anyway for users who only want to analyze some subset of collected data without monitoring the mempool. I think a set of compressed sqlite databases would work pretty well for that. Querying over a series of compressed archives can indeed be a challenge, so maybe users need to import the archives into their main database first.
Querying over a series of compressed archives can indeed be a challenge, so maybe users need to import the archives into their main database first.
Agree
Then let's go with SQL tables similar to
mempool_tx(txid, firstSeen, fee, size, output amount)
confirmed_tx(height, index, firstSeen, fee, size, output amount, seen_in_mempool)
block(heigth, hash, ...)
After x
-confirmations (e.g. 144 or maybe lower like 20) a tx is moved from from mempool_tx
to confirmed_tx
. Every n
blocks (e.g. block_height % 500 == 0
) a export of the block
+ confirmed_tx
tables in the range is generated and moved to a seperate database file.
Current data suggests we are looking at a minimum of ~20mb of daily uncompressed data (~3mb lzma-compressed data). This is currently only counting
first_seen
timestamp without any other metadata.In order to avoid having a single file with unconstrained growth, we could write confirmed transactions into a separate database
confirmed_$n.db
(with different$n
for every 100 blocks or so) which can be compressed for long-term storage.This allows a simple way of sharing datasets by copying
mempool.db.xz
and the most-recentconfirmed_$n.db.xz