IntersectMBO / cardano-db-sync

A component that follows the Cardano chain and stores blocks and transactions in PostgreSQL
Apache License 2.0
289 stars 162 forks source link

Remove bech32 data columns (or atleast make them optional) #1835

Open rdlrt opened 2 weeks ago

rdlrt commented 2 weeks ago

Problem Report

Currently dbsync stores bech32 encoded strings for various columns across tables.

  1. These text fields are indexed as hash type, which are sub-optimal , as they cannot be used in parallel. Thus, restoring from a snapshot - ends up waiting for 10-12 hours for index against tx_out.address - as it only uses a single CPU/worker for this process (atleast as of Postgres 16, as listed here ).

  2. Secondly, the encoding standards are often based on CIP, and a change in CIP (eg: CIP129) would mean dbsync can only use/implement it when changing previous data, which could be quite an unnecessary data operation. Larger the history grows, bigger the resistance would be to add any improvements to naming if it has a large data migration dependency.

The request is for dbsync not to store bech32 data at all, or if they need to be - atleast make them optional. Instead, consumers can use the decoding/encoding on where filter of queries, by making use of extensions (eg: pg_bech32 , pg_base58, etc.). While hopefully the address field itself is moving to it's own table soon as part of #1820 , it's still a bit of an overkill to store duplicate copy of similar data.

Positive Outcomes

Cons

gufmar commented 2 weeks ago
  • Lookup on bytea being faster than text

This alone sounds like a good reason. I generally like the proposed change

Fell-x27 commented 2 weeks ago

It's not just about Bech32; it's about text address representation in general. So, this applies to both Bech32 for Shelley and Base58 for Byron.

I don't think removal is a good idea, but making them optional would be the best solution.In fact, raw address data is sufficient for everything. It can be naturally indexed without hashing, and it takes up less memory for storage, even if we're talking about a separate table for addresses in the future.

Cardano is about CBOR, not text. If you need to find any text address in the database, just convert it to raw data—whether on the client/backend side or even at the PostgreSQL level. This will be even faster than implicit text hashing.

Moreover, as for Shelley, we don't use text view at all due to the nature of the addresses. We only need the spending part, which is already stored as binary data. If we need to build a text address, we do it manually by deriving stake keys ourselves to ensure they are correct, and then we build the full address.

It might be a good idea to use the payment_cred column for Byron addresses too and define an explicit column for the address era. Of course, we can determine whether it's Byron or not by certain characteristics like the length of the binary, but this is not a reliable solution—it's an opportunistic workaround. Tomorrow, there could be a fourth address type, and this hack would no longer be effective.

kderme commented 2 weeks ago

A bit unfortunate that recently we removed the raw bytea address, with the assumption that text address is used more frequently.

These text fields are indexed as hash type

It doesn't really change for bytea, especially since the size of addresses can be quite big due to older bugs.

rdlrt commented 2 weeks ago

It doesn't really change for bytea, especially since the size of addresses can be quite big due to older bugs.

The bytea fields can still be made btree indexed like for block/tx tables - which allows for parallel indexing

kderme commented 2 weeks ago

btree indexed data must not exceed 2704 bytes, but we have

select max (length (address)) from tx_out;
  max  
-------
 30864

raw bytes are ofcourse slightly smaller, but still. Also hash indexes are much faster to query in cases like that.

rdlrt commented 2 weeks ago

Ugh, these 17 records on mainnet are from byron era :( and pointer addresses are now retired. We'd have to live with that indexing forever on cardano? Coz currently (5 years history) , that index creation is taking 8-12 hours... coming years it might take multiple days for one index alone

kderme commented 2 weeks ago

There should be some workaround, I haven't put much thought into it. We could insert a hash of the address when it's out of limits and introduce a new table huge_address. Just thinking out loud, I'm not sure if this will cause too much disturbance to users.

rdlrt commented 2 weeks ago

Ye - was my preference too (to split them out to seperate table or column - especially with address detail moving out). Move those anomolies out and be able to optimise tx_out better. I didnt find a single hit for those addresses in query logs for koios either.