Closed CherryDT closed 8 months ago
Is that an electro-magnetic disk or an SSD? EM disks are significantly slower than SSDs.
SSD (via EBS)
But before the upgrade, it was working fine on 4.0.0...
Do you have IO throughtput numbers for your storage? And is the storage direct connected to the machine or is it on the LAN?
If I understand the AWS EBS docs correctly, this volume would give me 128 MiB/s. I believe it's somehow networked but I never had an issue with that.
Still, that's indeed a bit slow. But why would this have become an issue only after the upgrade?
Also, I tried something else now: upgrading to an instance with 8 cores instead of 4. To my surprise that solved the problem, but it's not really logical to me why that would be the case. (plus, it's very expensive and I'd prefer not to keep it this way...)
Disk speed would have sounded much more logical but the number of CPU cores shouldn't affect the issue that much then, right? I'm confused.
4.0.0
stored much, much less data in the database than 10.0.x
. 4 CPU cores is sufficient.
There is simply no reason why 8 cores would be faster. I suspect that its simply a different machine with faster IO.
Hm. Thank you, I'll upgrade to a larger volume then (which comes with higher iops as well) and see.
Hi @CherryDT, are you using gp2 or gp3? Maybe you can try gp3 with higher IOPS and you don't have to increase the volume size.
@CherryDT If this is sorted out now, please close this ticket.
It seems the 8 core thing worked only temporarily/randomly, the issue came back.
Anyway, I have switched over to a gp3 volume with 6000 iops and 500mb/s throughput. But, the problem still exists, although it's "a bit" better - it's now lagging only 10-15 blocks behind, but still doesn't fully catch up even after hours. CPU usage is now 90-100% by cardano-db-sync, and 2x 40-50% by two postgres processes (there are more postgres processes but they use less CPU).
Those two processes have the names postgres: postgres cexplorer 172.18.0.5(38560) INSERT
and postgres cexplorer 172.18.0.5(39832) UPDATE
It seems the 8 core thing worked only temporarily/randomly, the issue came back.
8 cores is more than double what is actually required, When you first moved machines I suspect you had better IO and that is what improved and then something changed dropping you back to the old performance.
This storage is direct connected to the machine and not via a LAN right?
I just checked, it shows up as NVMe device directly.
@rhyslbw ?
@CherryDT Based on my testing for running ADATools.io, I've found AWS largely unusable / extremely expensive. Sometimes queries timeout or never complete. I couldn't read anywhere what instance type you're using, but keep in mind that AWS loves to use crediting systems. I.e, you can have "burstable" instances but then they drop of dramatically once you've used the credits. While it might say that you're using high CPU resources, it can be deceptive as it can also mean how much is stuck in the queue for the CPU to process but it's waiting for something else (network or disk).
Based on my tests, the 1st thing you need to prioritise is your disk IOPS and 6000 is actually really small for DB usage of this nature. As a reference, DigitalOcean and Vultr can consistently hit above 100,000 SSD IOPS. I've found Vultrs baremetal severs to get your best value for money, and since you have the whole box you have consistent resources. Here's an article that shows the IOPS difference between AWS and DO: https://dzone.com/articles/iops-benchmarking-disk-io-aws-vs-digitalocean
The 2nd thing to keep an eye on is your CPU frequency. Since a lot of it prefers single thread resources, a higher frequency CPU can have better outcomes then multi-cores. That's another reason I prefer DigitalOcean and Vultr and they both offer High Frequency CPU options.
Thank you, I'll try DO or Vultr... Maybe the iops and/or speed should be part of the minimum system requirements in the docs then...?
Thank you, I'll try DO or Vultr... Maybe the iops and/or speed should be part of the minimum system requirements in the docs then...?
Unfortunately there's no direct formula to decide what is a minimum IOPS as it all depends on how complex your query is. E.g, doing a simple transaction lookup would be a lot quicker than doing a sum of all transaction outputs, which would mean someone might only need 5000 IOPS while someone else would need 30,000 IOPS.
IOPS is just another aspect of system design, however due to the diversity of Cardano it makes it more obvious.
Hm but here the issue is that the DB doesn't even get fully synced because of it, so the baseline of "no query at all yet" should probably have some number on it... (which I appear to be not reaching)
I wouldn't know what a minimum IOPS would be as I've always gone for systems that are above 65,000 IOPS. It could run on less, but I wouldn't know.
I have now same issue.
Node is synced, only db-sync is actually ~40 minutes behind (not stucked, still inserting blocks, but slowly).
v10 bare metal, nvme, 48 cores, 50% cpu usage (total, so there is not issue with cpu overload)
isnt this related to reward calculations as we are in middle of epoch?
Node is synced, only db-sync is actually ~40 minutes behind (not stucked, still inserting blocks, but slowly).
So db-sync
is 40 minutes behind the node
for 40 minutes? How much behind?
@erikd it was behind in time of writing 40 minutes behind node (avg 30 minutes) maybe total lagging for 4 hours. in some time 60 minutes behind node, later then 20 minutes behind node, avg maybe 30 minutes. now is it okay.
@erikd btw is db-sync v10 compatible with cardano-node 1.29.0?
@cardanians Sorry, not sure.
We are also encountering this issue - dbsync seems to be struggling to insert blocks at the same speed as the node is receiving them.
Are there any debugging logs we can turn on to discover the source of latency?
PG is co-located on the same physical machine as the node and dbsync. CPU usage < 30% RAM usage < 50% disk IO queue is very small, no spikes...
@stefanofranz Machine specs?
@stefanofranz Machine specs?
Ryzen 5900, 64gb ram and 1tb 970 EVO plus SSD x2 in raid 1
pg.conf
# DB Version: 13
# OS Type: linux
# DB Type: dw
# Total Memory (RAM): 24 GB
# CPUs num: 4
# Connections num: 20
# Data Storage: ssd
max_connections = 20
shared_buffers = 6GB
effective_cache_size = 18GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 157286kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2
synchronous_commit = off
listen_addresses = '*'
Raid 1 is mirroring. I would be expect that to be slower. Would you be able to try this without mirroring?
yes - I can turn it off, but it's very unlikely to be the problem - I've benchmarked PG and it's recording between 50-100K IOPS with latencies in < 1ms on average. The disk queue is effectively empty all the time.
It just seems that dbsync itself is running slow, is there a way I can print out the time it's taking to do PG ops? The next step will be to install monitoring in PG itself.
@stefanofranz Machine specs?
Ryzen 5900, 64gb ram and 1tb 970 EVO plus SSD x2 in raid 1
Is the pg.conf from this git setup or is this your own Postgres setup config? I haven't changed or seen db-sync configs at all so I'm not sure on what the defaults are, but max_connections = 20
seems small to me for any database.
@stefanofranz Machine specs?
Ryzen 5900, 64gb ram and 1tb 970 EVO plus SSD x2 in raid 1
Is the pg.conf from this git setup or is this your own Postgres setup config? I haven't changed or seen db-sync configs at all so I'm not sure on what the defaults are, but
max_connections = 20
seems small to me for any database.
it came from: https://pgtune.leopard.in.ua/#/ which has always worked very well - I set connections to 20 as it is ONLY used for dbsync
I have not seen this. Until someone figures out to reproduce it, I cannot debug it.
Has it been established that it is db-sync that is falling behind cardano-node? I'm wondering if it's a possibility that cardano-node is the one that's falling behind.
At least in my case I believed it to be db-sync because in the log there were messages that indicated new block X being the new tip accompanied by messages that indicated block X-80 being processed, and there were also PG processes using CPU
yes just like with @CherryDT - my node is on say block X, db-sync says it's insertingShellyBlock X-500 or so, with the number being different at different times.
by the way my client is working on getting a Vultr account... I'll let you know once it's all set up whether it's now working well...
@stefanofranz Machine specs?
Ryzen 5900, 64gb ram and 1tb 970 EVO plus SSD x2 in raid 1
Is the pg.conf from this git setup or is this your own Postgres setup config? I haven't changed or seen db-sync configs at all so I'm not sure on what the defaults are, but
max_connections = 20
seems small to me for any database.it came from: https://pgtune.leopard.in.ua/#/ which has always worked very well - I set connections to 20 as it is ONLY used for dbsync
@stefanofranz When you say, only used for db-sync, do you have no applications connections to Postgres?
I can understand db-sync falling behind if it's fighting for a connection to Postgres while applications are hogging the connections. To me this is the most likely scenario in what's happening, especially if it changes throughout the day based on application peak loads from users.
If it's just sitting there and all it's doing is inserting a block every 20 seconds and it's somehow falling behind, that's a weird one.
On Vultr it works OK now, it seems.
I still think there should be a big warning about this issue at the place where the minimum specs are described (ideally also in the main cardano-graphql readme), stating that the latest version needs huge I/O resources and any regular configuration on AWS won't suffice even if CPU/RAM resources are well beyond the minimum, recommending a good configuration like Vultr bare metal as you did here. I imagine I'm not the only one who has their whole infrastructure set up on AWS (given it's a pretty popular service used by many large companies too) and had it working well there before until this update...
(To be honest I'm still surprised this is even such an issue, because I also have a Bitcoin node with ElectrumX on top which has many times larger a blockchain to cope with and it runs with a cheaper configuration than this just fine - with larger harddisk of course, but still... and we have a few other coins with similar setups too, like Dogecoin plus Blockbook, and I never encoutered any I/O bottlenecks, and until now all volumes were gp2 with default settings, I never had to tweak anything there. I'm curious then, is this sort of - expensive - resource demand really needed and justified? What will happen when Cardano gains more traction and grows many times larger? Will it still be feasible to even run such a service...?)
To be honest I'm still surprised this is even such an issue, because I also have a Bitcoin node with ElectrumX on top which has many times larger a blockchain
The bitcoin blockchain may be bigger in terms of bytes, but is probably not bigger in terms of number of blocks. BTC has a block every approx. 10 minutes, while Cardano has one every 20 seconds.
I suspect the system requirements of ElectrumX are significantly less than db-sync
and PostgreSQL
.
Regardless of all that, until we have a way of reproducing this issue, I cannot reproduce it.
I can give you access to the troubled install on the AWS server if that helps... You can find me in the member list of the Cardano Discord with the same username.
EDIT: Emailed you.
To be honest I'm still surprised this is even such an issue, because I also have a Bitcoin node with ElectrumX on top which has many times larger a blockchain
If the IOPS is the issues as it's appearing to be, then I'm not surprised and this more of an understanding on how IOPS has an impact on database performance in general and also how AWS and EBS works.
For a general overview for those that aren't aware, GP2 on AWS increases IOPS as you increase storage or you need to provision enough IOPS yourself to get to the same level as a direct access nvme drive.
Depending on the storage type, AWS also uses quota systems which might explain why it sometimes drops out. E.g, some EBS volumes give you credits to "burst" above the baseline for as long as you have credits. If Cardano has a large amount of transactions to process, it could start eating into the credits and if it runs out, it would then start to lag.
Here's how AWS provisions their IOPS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
It looks like there hasn't been any activity on this issue for a some time. We're closing it for now. If you're still experiencing this problem or have any updates, please reopen or create a new issue. Your feedback is valuable, thank you!
I am using the cardano-graphql Docker setup. The issue is that since I upgraded it to 5.0.0, the sync is no longer working properly.
The initial sync took ages, and now the situation is that it reports as fully synced but in reality it's always lagging ~80 blocks behind the tip, and the
cardano-db-sync
process is using 200% CPU (2 out of 4 cores) and ~9 GB RAM. Note that the remaining two cores are mostly idle.The system is on a server well beyond the minimum spec listed here. It's a r5a.xlarge AWS server, with Ubuntu 20, 4 CPU cores, 32 GB RAM and 160 GB harddisk.
The logs for
cardano-db-sync-extended
seem to show normal operation (but if you compare the block numbers to a block explorer you'll see it's lagging behind):What can I do to diagnose and fix this?