Open rdlrt opened 1 year ago
Considering the size of the tx_out
table and the amount of repeated data it currently contains, I think this suggestions makes sense.
I haven't found a case where this query takes more than a few ms
cexplorer=# select count(1) from tx_out where stake_address_id = 3350960;
It also looks strange to me for a query with an indexed search to take so long.
I haven't found a case where this query takes more than a few ms
The query was shared just as an example - as actual ID might vary, the number of records returned for stake_address is pretty high (~2.7M), you might wanna try below:
select count(1) from tx_out where stake_address_id = (select id from stake_address where view='stake1uxqh9rn76n8nynsnyvf4ulndjv0srcc8jtvumut3989cqmgjt49h6');
Ofcourse - once the filtered result is part of Postgres cache tile, the query result reduces to few ms - but that'd be unpredictable (and unrealistic). The gains achieved from absence of those fields will still be benefecial even if this was not an issue
I understand the benefits in Space and IOPs by splitting in two tables. However I'm not convinced queries will be faster. Working on two tables requires joins, which can't be faster than working with a single table.
My understanding here might be incorrect, but address_has_script seems like a permanent state rather than a transient state - thus, does not seem to be right candidate in tx_out table.
I'm not following that one. The tx_out
table like all db-sync tables are permanent in the sense that there are no updates.
address_has_script
shows a permanent property of the address.
However I'm not convinced queries will be faster.
The performance would be much faster to query against address fields purely due to smaller index size lookup against tx_id (eliminated by amount of duplicates as well as size of lookup string). IMO- this is exactly same classification as stake_address, having a seperate lookup table vs having multiple fields with duplicates for encoding. While the query performance increase will not be massive, it is certainly not negligible, specifically for looking up transactions by addresses (can drill down into specific use cases if needed, but IMO - the gains to dbsync resources even outside of performance are substantial)
The tx_out table like all db-sync tables are permanent in the sense that there are no updates. address_has_script shows a permanent property of the address.
IMO - a UTxO does not represent address, so address_has_script cannot be a permanent state, but current state. If the intent is to showcase a UTxO , it could instead be better named ( but not too fussed about this one).
Overall I'd reckon splitting addresses out to seperate table seems a bit unanimous decision from data savings pov (and IOPs) alone, while other points are smaller benefits.
Sorry of extense comment but this change will help me massively. Hopefully others too.
I'm using cardano-graphql
with a custom View PaymentAddress
+ Hasura schema modifications to simulate "some of" what this issue tries to solve:
CREATE OR REPLACE VIEW "PaymentAddress" AS
SELECT
tx_out.address,
tx_out.stake_address_id AS "stake_address_id",
tx_out.address_has_script AS "addressHasScript",
CASE
WHEN tx_out.address_has_script IS TRUE THEN tx_out.payment_cred
ELSE NULL
END AS "scriptHash",
CASE
WHEN tx_out.address_has_script IS NOT TRUE THEN tx_out.payment_cred
ELSE NULL
END AS "paymentCredential"
FROM tx_out
The problem comes when you have to use aggregate
with hasura or you have to pagginate your results.
Having to use distinct on (address) address
every time i want to count
results or paginate limit X ofsset Y
is extremely slow. Even if i use the distinct on
in the View creation (instead of in hasura) it does not improve much.
I'm also doing a "view calculation" to distinguish between scriptHash
or paymentCredential
. This becomes really bad with hasura, since the way it works is to select * from table
and then filter the results, which at the end results in psql calculating the scriptHash
and paymentCredential
columns for every row even if i only want to extract 10 results (limit 10
).
Example count total addresses:
query totalPaymentAddresses {
paymentAddresses_aggregate(distinct_on: address) {
aggregate {
count
}
}
}
{
"data": {
"paymentAddresses_aggregate": {
"aggregate": {
"count": 40969996
}
}
}
}
Time: 754549 ms !!!!
Example listing addresses (smart contracts):
query addressWithScript {
paymentAddresses(where: {addressHasScript: {_eq: true}}, offset: 0, limit: 2, distinct_on: address) {
address
script {
hash
}
}
}
{
"data": {
"paymentAddresses": [
{
"address": "addr12yqg7sr6pp7ndc6vuq4fj64wgzp9zhh45jv4wngxcf0aqxruzkw47xrh6a0",
"script": null <-- null because issue #1357
},
{
"address": "addr1w8008846y0s9l6mfvaa7scefkgvwa7cq4mxlx6m4xs8v62g6e5v0f",
"script": {
"hash": "\\xdef39eba23e05feb69677be86329b218eefb00aecdf36b75340ecd29"
}
}
]
}
}
Time 100292 ms !!!!
IMO separating in two tables will help reducing overall table sizes, index sizes and improve versatility on queries since (just as exposed in a simple example) we will not depend on distinct on
when listing addresses.
BTW, this examples are using cardano-graphql
hasura method, but using plain psql query is extreme slow too just because of the distinct on
over 168.111.506 tx_out reccords, and increasing....
This is something that would help in some things we do greatly.
select
distinct on (address) address
from
tx_out
where
address_raw = '\x016b610dbb760e3c9bb0fccd1ba3b1a767dd263cc39cfa8ab5c88481cea47293b3fe38dc733eb39c592bff9b5f88596684da8a2f13a2b3fb7f'
I try to rely on pulling as my data direct from the db as possible. In the case of a query like this where i want the bech32 address while it's trivial to convert using python/js/cli i like to make some of my larger queries "backendless" and looking up an address from a raw (say one in a datum) can be painfully slow.
With this query a limit 1 could speed it up but it's no good for querying multiple addrs at a time.
Versions The
db-sync
version (egcardano-db-sync --version
): 13.0.5 PostgreSQL version: 14.4Build/Install Method The method you use to build or install
cardano-db-sync
: N.A.Run method The method you used to run
cardano-db-sync
(eg Nix/Docker/systemd/none): systemdAdditional context The background for this request comes from Koios , for an endpoint that allows users to query addresses associated with stake_address causing too long (ones which have millions of transaction entries) - but this applies to almost all query layers that use dbsync. When not part of Postgres cache , simply querying observations themselves could take a while (at times can take 15-50s):
Ofcourse - once in cache, this would run in a few milliseconds. However, the outcome (unique address list) being sought from same query was simply 12 addresses.
This made us think of starting to maintain seperate cache table for stake_address to address mapping, but it makes little sense to only do it on client side rather than dbsync itself, as the advantages will be higher than this scope.
Problem Report
The request here is to move address field to it's own table, alongwith duplicated information that currently resides in tx_out table. A new table
addresses
that might contain columns below, that are currently intx_out
:Similarly, the above columns could be removed from
tx_out
and replaced withaddr_id
which would link to address.id field.Advantages:
address
,address_raw
andpayment_cred
can all be derived from address itself. While it might have been initially for UX, the duplication of each of these long char fields for tx does more harm than good in tx_out table. They can still be maintained in separate address table.address_hash_script
: My understanding here might be incorrect, butaddress_has_script
seems like a permanent state rather than a transient state - thus, does not seem to be right candidate in tx_out table.Additional context
section above, the query outcomes will be much faster for such endpoints.