0xB10C / bademeister-go

Implementation of Bademeister, a mempool watcher and recorder, in Golang.
0 stars 3 forks source link

Add sqlite storage backend #6

Open OttoAllmendinger opened 5 years ago

OttoAllmendinger commented 5 years ago

My suggestion for the high-level interface:

0xB10C commented 5 years ago

I'd propose a slight modification of the above/further detail the spec:

OttoAllmendinger commented 5 years ago

I deliberately described the data fields instead of a struct here so we get an idea of what is actually stored. We can have a convenience AddTransaction() method that accepts the wire type. However we should be careful that we don't mix it up with the Transaction type that will result from reading from our storage, which will be very minimal.

OttoAllmendinger commented 5 years ago

Implemented in #14

Todo:

0xB10C commented 5 years ago

The firstSeen timestamp is currently written into the database as a DATE. DATEs are handled by SQLite internally as TEXT (as ISO8601 strings YYYY-MM-DD HH:MM:SS.SSS). This has a way higher precision than we need and is generally quite storage space inefficient. Better would be to store the firstSeen timestamp in Unix time as INTEGER which takes a maximum of 8 bytes. Unix timestamps are supported by the SQLite time and date functions. Additionally INTEGERS should be even better compressable than TEXT.

See https://www.sqlite.org/datatype3.html for more info on datatypes.

0xB10C commented 5 years ago

I did a quick test with TEXT (v1) and INTEGER (v2) timestamps. A database with 800k tx (~2 days) without removed txids is:

143M Oct 28 23:17 v1.db 
125M Oct 28 23:17 v2.db (12.5% smaller) 

With txids removed:

34M Oct 28 23:25 v1.db
17M Oct 28 23:24 v2.db (50% smaller)

Code used:

#!/usr/bin/env python

import sqlite3
import datetime
import time
from hashlib import sha256

connv1 = sqlite3.connect('v1.db')
connv2 = sqlite3.connect('v2.db')

cv1 = connv1.cursor()
cv2 = connv2.cursor()
cv1.execute('CREATE TABLE tx (txid blob unique, firstSeen date)')
cv2.execute('CREATE TABLE tx (txid blob unique, firstSeen int)')

for i in range(800000):
  txid = sha256(str(i)).hexdigest()
  firstSeen = datetime.datetime.now()
  cv1.execute("INSERT INTO tx ('txid', 'firstSeen') VALUES  (?, ?)", (txid, firstSeen))
  cv2.execute("INSERT INTO tx ('txid', 'firstSeen') VALUES  (?, ?)", (txid, int(time.time())))

connv1.commit()
connv2.commit()

connv1.close()
connv2.close()
0xB10C commented 5 years ago

I though a bit about the SQL table structure. This is all still in a WIP state and can be changed. I just wanted to write it down somewhere.

Currently the table where transactions are store is called mempool_tx. The table name should be changed to transaction. The table hold transactions included in blocks and in the mempool. Transactions in the mempool can be identified by not having a entry in the transaction_block table, which is discussed below. The table is missing an id as INTEGER PRIMARY KEY. By default the column INTEGER PRIMARY KEY is used as an auto-increment ROWID by SQLite. The id is used as foreign key in tables like transaction_block or transaction_ancestor. This saves a lot of disk space compared to writing the txid and block hash to the table.

Additionally some of the table colums are INT and can be changed to INTEGER for consistency. For SQLite INTEGER and INT are the same datatype.

-- Table: transaction 
CREATE TABLE "transaction" 
  ( 
     id         INTEGER PRIMARY KEY UNIQUE NOT NULL, 
     txid       BLOB (32), 
     first_seen INTEGER, 
     fee        INTEGER, 
     weight     INTEGER 
  );

The table block doesn't exist yet. The current form is quite minimalist and can be expanded. Similar to the transaction table it has an id.

-- Table: block 
CREATE TABLE block 
  ( 
     id         INTEGER PRIMARY KEY UNIQUE NOT NULL, 
     hash       BLOB (32) UNIQUE, 
     first_seen INTEGER, 
     height     INTEGER 
  ); 

The table transaction_block does not exist yet. It's used to save the relationship between a transaction and a block it's included in. In our model a transaction purposefully can be included in multiple blocks to account for reorgs. We still need a way to identify the most-work-chain. Each entry consists of a transaction_id and block_id as foreign keys to the respective transaction or block. Additionally a block_index is included which represents the index/position of the transaction in the block.

-- Table: transaction_block 
CREATE TABLE transaction_block 
  ( 
     transaction_id INTEGER REFERENCES "transaction" (id) NOT NULL, 
     block_id       INTEGER REFERENCES "block" (id) NOT NULL, 
     block_index    INTEGER NOT NULL 
  ); 

The table transaction_ancestor helps tracking CPFP relationships and transaction packages. It represents a reference from a transaction A (current) to a transaction B (ancestor), where A spends an output from B. Only packages of (at the time of insertion) unconfirmed transactions should be inserted here. (https://github.com/0xB10C/bademeister-go/issues/23)

-- Table: transaction_ancestor
CREATE TABLE transaction_ancestor
  ( 
     current  INTEGER REFERENCES "transaction" (id) NOT NULL, 
     ancestor INTEGER REFERENCES "transaction" (id) NOT NULL 
  ); 

How to store RBF'ed transactions is still open. A single RBF transaction can replace multiple other transactions. Should have another look once we are sure where we get the data from and what data we get. (https://github.com/0xB10C/bademeister-go/issues/24)

OttoAllmendinger commented 5 years ago

WIP branch https://github.com/0xB10C/bademeister-go/tree/issue6.insertblock