Remove bech32 data columns (or atleast make them optional)

rdlrt commented 2 months ago

Problem Report

Currently dbsync stores bech32 encoded strings for various columns across tables.

These text fields are indexed as hash type, which are sub-optimal , as they cannot be used in parallel. Thus, restoring from a snapshot - ends up waiting for 10-12 hours for index against tx_out.address - as it only uses a single CPU/worker for this process (atleast as of Postgres 16, as listed here ).
Secondly, the encoding standards are often based on CIP, and a change in CIP (eg: CIP129) would mean dbsync can only use/implement it when changing previous data, which could be quite an unnecessary data operation. Larger the history grows, bigger the resistance would be to add any improvements to naming if it has a large data migration dependency.

The request is for dbsync not to store bech32 data at all, or if they need to be - atleast make them optional. Instead, consumers can use the decoding/encoding on where filter of queries, by making use of extensions (eg: pg_bech32 , pg_base58, etc.). While hopefully the address field itself is moving to it's own table soon as part of #1820 , it's still a bit of an overkill to store duplicate copy of similar data.

Positive Outcomes

Reduces time taken for restore snapshot (especially the additional 8-12 hours it waits for single index) , due to index parallelisation, allowing better utilisation of resources.
Reduction in overall storage size of dbsync database
Remove data that is essentially derivable from another column.
Improve maintenance time to have up to date stats on tables, which in turn helps query planner create better guess (especially helpful when maintaining a materialized view or seperate cache table).
Consistency across users (users having dependency on bech32-encoded fields, others on raw format).
Lookup on bytea being faster than text

Cons

User having to install postgresql extension and update their SQL queries once to replace current bech32 columns with something like b32_encode(,encode(,'hex') ) and filters to be something like b32_decode() - (additional bits for network indicator on addresses or script indicator on drep can be added to extension itself to make it cardano-specific).

gufmar commented 2 months ago

Lookup on bytea being faster than text

This alone sounds like a good reason. I generally like the proposed change

Fell-x27 commented 2 months ago

It's not just about Bech32; it's about text address representation in general. So, this applies to both Bech32 for Shelley and Base58 for Byron.

I don't think removal is a good idea, but making them optional would be the best solution.In fact, raw address data is sufficient for everything. It can be naturally indexed without hashing, and it takes up less memory for storage, even if we're talking about a separate table for addresses in the future.

Cardano is about CBOR, not text. If you need to find any text address in the database, just convert it to raw data—whether on the client/backend side or even at the PostgreSQL level. This will be even faster than implicit text hashing.

Moreover, as for Shelley, we don't use text view at all due to the nature of the addresses. We only need the spending part, which is already stored as binary data. If we need to build a text address, we do it manually by deriving stake keys ourselves to ensure they are correct, and then we build the full address.

It might be a good idea to use the payment_cred column for Byron addresses too and define an explicit column for the address era. Of course, we can determine whether it's Byron or not by certain characteristics like the length of the binary, but this is not a reliable solution—it's an opportunistic workaround. Tomorrow, there could be a fourth address type, and this hack would no longer be effective.

kderme commented 2 months ago

A bit unfortunate that recently we removed the raw bytea address, with the assumption that text address is used more frequently.

These text fields are indexed as hash type

It doesn't really change for bytea, especially since the size of addresses can be quite big due to older bugs.

rdlrt commented 2 months ago

It doesn't really change for bytea, especially since the size of addresses can be quite big due to older bugs.

The bytea fields can still be made btree indexed like for block/tx tables - which allows for parallel indexing

kderme commented 2 months ago

btree indexed data must not exceed 2704 bytes, but we have

select max (length (address)) from tx_out;
  max  
-------
 30864

raw bytes are ofcourse slightly smaller, but still. Also hash indexes are much faster to query in cases like that.

rdlrt commented 2 months ago

Ugh, these 17 records on mainnet are from byron era :( and pointer addresses are now retired. We'd have to live with that indexing forever on cardano? Coz currently (5 years history) , that index creation is taking 8-12 hours... coming years it might take multiple days for one index alone

kderme commented 2 months ago

There should be some workaround, I haven't put much thought into it. We could insert a hash of the address when it's out of limits and introduce a new table huge_address. Just thinking out loud, I'm not sure if this will cause too much disturbance to users.

rdlrt commented 2 months ago

Ye - was my preference too (to split them out to seperate table or column - especially with address detail moving out). Move those anomolies out and be able to optimise tx_out better. I didnt find a single hit for those addresses in query logs for koios either.

IntersectMBO / cardano-db-sync

Remove bech32 data columns (or atleast make them optional) #1835