Add sqlite storage backend

OttoAllmendinger commented 5 years ago

My suggestion for the high-level interface:

addTransaction(txid: [32]byte, firstSeen: Time, fee, outputAmount: uint64, vSize: uint32)
- to keep things simple we should defer most tx data lookups to bitcoind, however since fee is a bit expensive to get via RPC we collect it here
addConfirmation(txid: [32]byte, blockHash: [32]byte, time: Time)
- due to reorgs a txs can be confirmed in more than one block
All timestamps are stored as UTC

0xB10C commented 5 years ago

I'd propose a slight modification of the above/further detail the spec:

We should not store the txid and the blockHash as strings. Should use BLOB in SQL and a byte slice in Go instead. The hashes are (next to public keys) the least compressible data used in Bitcoin. SQLite supports BLOB as primary keys. PostgreSQL can handle byte value primary keys as well. This will probably reduce the database size quite a bit.
We should maybe pass well defined structs to functions such as addTransaction(). See draft definitions in https://github.com/0xB10C/bademeister-go/issues/5#issuecomment-546121120.
I'd opt to directly use fee rather than inputAmount, but no strong preference besides fee probably being a bit smaller than inputAmount in terms of storage size required (since SQLite stores things as variable length binary data).
Using uint32 for satoshi amounts in Bitcoin does not work. Needs to be uint64.
Timestamps should be in UTC to avoid confusion down the line.

OttoAllmendinger commented 5 years ago

replaced string with [32]byte (we'll probably introduce a type alias)
replaced inputAmount with fee, use uint64
added note about UTC

I deliberately described the data fields instead of a struct here so we get an idea of what is actually stored. We can have a convenience AddTransaction() method that accepts the wire type. However we should be careful that we don't mix it up with the Transaction type that will result from reading from our storage, which will be very minimal.

OttoAllmendinger commented 5 years ago

Implemented in #14

store (txid, firstSeen) in table transactions

Todo:

add columns for fee, vsize
process confirmations

0xB10C commented 5 years ago

The firstSeen timestamp is currently written into the database as a DATE. DATEs are handled by SQLite internally as TEXT (as ISO8601 strings YYYY-MM-DD HH:MM:SS.SSS). This has a way higher precision than we need and is generally quite storage space inefficient. Better would be to store the firstSeen timestamp in Unix time as INTEGER which takes a maximum of 8 bytes. Unix timestamps are supported by the SQLite time and date functions. Additionally INTEGERS should be even better compressable than TEXT.

See https://www.sqlite.org/datatype3.html for more info on datatypes.

[ ] change datatype to INTEGER and save as Unix timestamp.

0xB10C commented 5 years ago

I did a quick test with TEXT (v1) and INTEGER (v2) timestamps. A database with 800k tx (~2 days) without removed txids is:

143M Oct 28 23:17 v1.db 
125M Oct 28 23:17 v2.db (12.5% smaller)

With txids removed:

34M Oct 28 23:25 v1.db
17M Oct 28 23:24 v2.db (50% smaller)

Code used:

#!/usr/bin/env python

import sqlite3
import datetime
import time
from hashlib import sha256

connv1 = sqlite3.connect('v1.db')
connv2 = sqlite3.connect('v2.db')

cv1 = connv1.cursor()
cv2 = connv2.cursor()
cv1.execute('CREATE TABLE tx (txid blob unique, firstSeen date)')
cv2.execute('CREATE TABLE tx (txid blob unique, firstSeen int)')

for i in range(800000):
  txid = sha256(str(i)).hexdigest()
  firstSeen = datetime.datetime.now()
  cv1.execute("INSERT INTO tx ('txid', 'firstSeen') VALUES  (?, ?)", (txid, firstSeen))
  cv2.execute("INSERT INTO tx ('txid', 'firstSeen') VALUES  (?, ?)", (txid, int(time.time())))

connv1.commit()
connv2.commit()

connv1.close()
connv2.close()

0xB10C commented 5 years ago

I though a bit about the SQL table structure. This is all still in a WIP state and can be changed. I just wanted to write it down somewhere.

Currently the table where transactions are store is called mempool_tx. The table name should be changed to transaction. The table hold transactions included in blocks and in the mempool. Transactions in the mempool can be identified by not having a entry in the transaction_block table, which is discussed below. The table is missing an id as INTEGER PRIMARY KEY. By default the column INTEGER PRIMARY KEY is used as an auto-increment ROWID by SQLite. The id is used as foreign key in tables like transaction_block or transaction_ancestor. This saves a lot of disk space compared to writing the txid and block hash to the table.

Additionally some of the table colums are INT and can be changed to INTEGER for consistency. For SQLite INTEGER and INT are the same datatype.

-- Table: transaction 
CREATE TABLE "transaction" 
  ( 
     id         INTEGER PRIMARY KEY UNIQUE NOT NULL, 
     txid       BLOB (32), 
     first_seen INTEGER, 
     fee        INTEGER, 
     weight     INTEGER 
  );

The table block doesn't exist yet. The current form is quite minimalist and can be expanded. Similar to the transaction table it has an id.

-- Table: block 
CREATE TABLE block 
  ( 
     id         INTEGER PRIMARY KEY UNIQUE NOT NULL, 
     hash       BLOB (32) UNIQUE, 
     first_seen INTEGER, 
     height     INTEGER 
  );

The table transaction_block does not exist yet. It's used to save the relationship between a transaction and a block it's included in. In our model a transaction purposefully can be included in multiple blocks to account for reorgs. We still need a way to identify the most-work-chain. Each entry consists of a transaction_id and block_id as foreign keys to the respective transaction or block. Additionally a block_index is included which represents the index/position of the transaction in the block.

-- Table: transaction_block 
CREATE TABLE transaction_block 
  ( 
     transaction_id INTEGER REFERENCES "transaction" (id) NOT NULL, 
     block_id       INTEGER REFERENCES "block" (id) NOT NULL, 
     block_index    INTEGER NOT NULL 
  );

The table transaction_ancestor helps tracking CPFP relationships and transaction packages. It represents a reference from a transaction A (current) to a transaction B (ancestor), where A spends an output from B. Only packages of (at the time of insertion) unconfirmed transactions should be inserted here. (https://github.com/0xB10C/bademeister-go/issues/23)

-- Table: transaction_ancestor
CREATE TABLE transaction_ancestor
  ( 
     current  INTEGER REFERENCES "transaction" (id) NOT NULL, 
     ancestor INTEGER REFERENCES "transaction" (id) NOT NULL 
  );

How to store RBF'ed transactions is still open. A single RBF transaction can replace multiple other transactions. Should have another look once we are sure where we get the data from and what data we get. (https://github.com/0xB10C/bademeister-go/issues/24)

OttoAllmendinger commented 5 years ago

WIP branch https://github.com/0xB10C/bademeister-go/tree/issue6.insertblock

0xB10C / bademeister-go

Add sqlite storage backend #6