Split out addresses table from tx_out

rdlrt commented 1 year ago

Versions The db-sync version (eg cardano-db-sync --version): 13.0.5 PostgreSQL version: 14.4

Build/Install Method The method you use to build or install cardano-db-sync: N.A.

Run method The method you used to run cardano-db-sync (eg Nix/Docker/systemd/none): systemd

Additional context The background for this request comes from Koios , for an endpoint that allows users to query addresses associated with stake_address causing too long (ones which have millions of transaction entries) - but this applies to almost all query layers that use dbsync. When not part of Postgres cache , simply querying observations themselves could take a while (at times can take 15-50s):

cexplorer=# select count(1) from tx_out where stake_address_id = 3350960;
  count
---------
 2648923
(1 row)

Time: 23476.114 ms (00:23.476)

Ofcourse - once in cache, this would run in a few milliseconds. However, the outcome (unique address list) being sought from same query was simply 12 addresses.

This made us think of starting to maintain seperate cache table for stake_address to address mapping, but it makes little sense to only do it on client side rather than dbsync itself, as the advantages will be higher than this scope.

Problem Report

The request here is to move address field to it's own table, alongwith duplicated information that currently resides in tx_out table. A new table addresses that might contain columns below, that are currently in tx_out:

|id|address | address_raw (optional) | payment_cred (optional) | stake_address_id | address_has_script |

Similarly, the above columns could be removed from tx_out and replaced with addr_id which would link to address.id field.

Advantages:

Space and IOPs: Currently , size of tx_out table is ~40G and it's indexes is ~20G. A huge chunk of this size is due to repitition of address information across multiple rows every time a transaction references an address. In turn, this also increases index size which could would much larger footprint to process, replacing it with a single address_id field saves on both fronts as well as lookup in tx_out table. The average savings achieved might be ~25-30G atm ( + indexes)
Reduction in duplications: Columns address, address_raw and payment_cred can all be derived from address itself. While it might have been initially for UX, the duplication of each of these long char fields for tx does more harm than good in tx_out table. They can still be maintained in separate address table.
Transient state of address_hash_script: My understanding here might be incorrect, but address_has_script seems like a permanent state rather than a transient state - thus, does not seem to be right candidate in tx_out table.
Performance for querying addresses: For cases like the one mentioned in Additional context section above, the query outcomes will be much faster for such endpoints.
Reduce client-side workarounds: A seperate cache table was being considered to provide this information based on trigger from tx table (but would not have been an optimal solution given the above)

erikd commented 1 year ago

Considering the size of the tx_out table and the amount of repeated data it currently contains, I think this suggestions makes sense.

kderme commented 1 year ago

I haven't found a case where this query takes more than a few ms

cexplorer=# select count(1) from tx_out where stake_address_id = 3350960;

It also looks strange to me for a query with an indexed search to take so long.

rdlrt commented 1 year ago

I haven't found a case where this query takes more than a few ms

The query was shared just as an example - as actual ID might vary, the number of records returned for stake_address is pretty high (~2.7M), you might wanna try below:

select count(1) from tx_out where stake_address_id = (select id from stake_address where view='stake1uxqh9rn76n8nynsnyvf4ulndjv0srcc8jtvumut3989cqmgjt49h6');

Ofcourse - once the filtered result is part of Postgres cache tile, the query result reduces to few ms - but that'd be unpredictable (and unrealistic). The gains achieved from absence of those fields will still be benefecial even if this was not an issue

kderme commented 1 year ago

I understand the benefits in Space and IOPs by splitting in two tables. However I'm not convinced queries will be faster. Working on two tables requires joins, which can't be faster than working with a single table.

My understanding here might be incorrect, but address_has_script seems like a permanent state rather than a transient state - thus, does not seem to be right candidate in tx_out table.

I'm not following that one. The tx_out table like all db-sync tables are permanent in the sense that there are no updates. address_has_script shows a permanent property of the address.

rdlrt commented 1 year ago

However I'm not convinced queries will be faster.

The performance would be much faster to query against address fields purely due to smaller index size lookup against tx_id (eliminated by amount of duplicates as well as size of lookup string). IMO- this is exactly same classification as stake_address, having a seperate lookup table vs having multiple fields with duplicates for encoding. While the query performance increase will not be massive, it is certainly not negligible, specifically for looking up transactions by addresses (can drill down into specific use cases if needed, but IMO - the gains to dbsync resources even outside of performance are substantial)

The tx_out table like all db-sync tables are permanent in the sense that there are no updates. address_has_script shows a permanent property of the address.

IMO - a UTxO does not represent address, so address_has_script cannot be a permanent state, but current state. If the intent is to showcase a UTxO , it could instead be better named ( but not too fussed about this one).

Overall I'd reckon splitting addresses out to seperate table seems a bit unanimous decision from data savings pov (and IOPs) alone, while other points are smaller benefits.

infnada commented 1 year ago

Sorry of extense comment but this change will help me massively. Hopefully others too.

I'm using cardano-graphql with a custom View PaymentAddress + Hasura schema modifications to simulate "some of" what this issue tries to solve:

CREATE OR REPLACE VIEW "PaymentAddress" AS
SELECT
  tx_out.address,
  tx_out.stake_address_id AS "stake_address_id",
  tx_out.address_has_script AS "addressHasScript",
  CASE 
      WHEN tx_out.address_has_script IS TRUE THEN tx_out.payment_cred
      ELSE NULL
  END AS "scriptHash",
  CASE 
      WHEN tx_out.address_has_script IS NOT TRUE THEN tx_out.payment_cred
      ELSE NULL
  END AS "paymentCredential"
FROM tx_out

The problem comes when you have to use aggregate with hasura or you have to pagginate your results.

Having to use distinct on (address) address every time i want to count results or paginate limit X ofsset Y is extremely slow. Even if i use the distinct on in the View creation (instead of in hasura) it does not improve much.

I'm also doing a "view calculation" to distinguish between scriptHash or paymentCredential. This becomes really bad with hasura, since the way it works is to select * from table and then filter the results, which at the end results in psql calculating the scriptHash and paymentCredential columns for every row even if i only want to extract 10 results (limit 10).

Example count total addresses:

query totalPaymentAddresses {
  paymentAddresses_aggregate(distinct_on: address) {
    aggregate {
      count
    }
  }
}

{
  "data": {
    "paymentAddresses_aggregate": {
      "aggregate": {
        "count": 40969996
      }
    }
  }
}

Time: 754549 ms !!!!

Example listing addresses (smart contracts):

query addressWithScript {
  paymentAddresses(where: {addressHasScript: {_eq: true}}, offset: 0, limit: 2, distinct_on: address) {
    address
    script {
      hash
    }
  }
}

{
  "data": {
    "paymentAddresses": [
      {
        "address": "addr12yqg7sr6pp7ndc6vuq4fj64wgzp9zhh45jv4wngxcf0aqxruzkw47xrh6a0",
        "script": null <-- null because issue #1357
      },
      {
        "address": "addr1w8008846y0s9l6mfvaa7scefkgvwa7cq4mxlx6m4xs8v62g6e5v0f",
        "script": {
          "hash": "\\xdef39eba23e05feb69677be86329b218eefb00aecdf36b75340ecd29"
        }
      }
    ]
  }
}

Time 100292 ms !!!!

IMO separating in two tables will help reducing overall table sizes, index sizes and improve versatility on queries since (just as exposed in a simple example) we will not depend on distinct on when listing addresses.

BTW, this examples are using cardano-graphql hasura method, but using plain psql query is extreme slow too just because of the distinct on over 168.111.506 tx_out reccords, and increasing....

Legionaaa commented 1 year ago

This is something that would help in some things we do greatly.

select
    distinct on (address) address
from
    tx_out
where
    address_raw = '\x016b610dbb760e3c9bb0fccd1ba3b1a767dd263cc39cfa8ab5c88481cea47293b3fe38dc733eb39c592bff9b5f88596684da8a2f13a2b3fb7f'

I try to rely on pulling as my data direct from the db as possible. In the case of a query like this where i want the bech32 address while it's trivial to convert using python/js/cli i like to make some of my larger queries "backendless" and looking up an address from a raw (say one in a datum) can be painfully slow.

With this query a limit 1 could speed it up but it's no good for querying multiple addrs at a time.

IntersectMBO / cardano-db-sync

Split out addresses table from tx_out #1333