Phala-Network / phala-blockchain

The Phala Network Blockchain, pRuntime and the bridge.
https://phala.network
Apache License 2.0
331 stars 148 forks source link

pherry performance drop after last emergency update #554

Open darknight7 opened 3 years ago

darknight7 commented 3 years ago

I am experiencing an extremely slow sync speed in pherry, if you need further details just ask, I try to put all relevant information in this first post.

1) node is running on a standalone 4-core machine, cpu usage never goes over 400% so it looks fine to me, connection is gigabit and 2 out of 3 pherries displaying the problem are in lan 2) relevant section of docker-compose.yml

 phala-pherry:
   image: ${PHERRY_IMAGE}
   container_name: phala-pherry
   hostname: phala-pherry
   depends_on:
 #   - phala-node
    - phala-pruntime
   restart: always
   entrypoint:
    [
      "/root/pherry",
      "-r",
      "--parachain",
      "--mnemonic=${MNEMONIC}",
      "--substrate-ws-endpoint=ws://123.456.789.0:9945",
      "--collator-ws-endpoint=ws://123.456.789.0:9944",
      "--pruntime-endpoint=http://phala-pruntime:8000",
      "--operator=${OPERATOR}",
      "--fetch-blocks=512",
    ]
   logging:
     options:
       max-size: "10m"
       max-file: "3"

3) the images I'm running look fine to me

REPOSITORY                              TAG       DIGEST                                                                    IMAGE ID       CREATED        SIZE
phalanetwork/phala-pruntime             latest    sha256:141bb1f7a85351ead59f9e8e4c8a61b6250f2a227c5ed1689589ac1844116473   e5c2aa39f3a9   4 days ago     322MB
phalanetwork/phala-pherry               latest    sha256:abbda7358c64f515dbd7d9e5e7b1ec49380cce588549d2bd3ba8c88581d84ddb   20167d3c47d8   5 days ago     303MB
phalanetwork/khala-node                 latest    sha256:c41b2722ca78287aee30be2394c13878097dc13bf17bd3df6fa78d17600db880   4012617d3784   4 weeks ago    442MB
phalanetwork/phala-dev-pruntime-bench   latest    sha256:bc34f750e9001c8f6ff70bc833346553884a473ffe5453cca49a9b4c4d301ec8   79052c8adfe9   7 weeks ago    320MB
phalanetwork/phala-sgx_detect           latest    sha256:81cd734fe7819687651242f5ddcf5816e742c245b91d24fe61a6bf22b23bb549   618098047ae7   2 months ago   319MB

4) some of the logs, in the beginning it was FAR slower, like one block in 10 seconds or less, but I thought it was just the whole network coming back after the update

phala-pherry      | [2021-11-04T11:21:02Z INFO  pherry] get_block: Got block Some(8939007) hash 0x1772…0e2d
phala-pherry      | [2021-11-04T11:21:02Z INFO  pherry] get_block: Got block Some(8939008) hash 0x958f…71c5
phala-pherry      | [2021-11-04T11:21:02Z INFO  pherry] get_block: Got block Some(8939009) hash 0xf167…0bb6
phala-pherry      | [2021-11-04T11:21:02Z INFO  pherry] sending a batch of 512 headers (last: 8939008, change: None)
phala-pherry      | [2021-11-04T11:21:02Z INFO  phactory_api::pruntime_client] Response: 200 OK
phala-pherry      | [2021-11-04T11:21:02Z INFO  pherry]   ..sync_header: SyncedTo { synced_to: 8939008 }
phala-pherry      | [2021-11-04T11:21:02Z INFO  pherry] relaychain finalized paraheader number: 262691
phala-pherry      | [2021-11-04T11:21:05Z INFO  phactory_api::pruntime_client] Response: 200 OK
phala-pherry      | [2021-11-04T11:21:05Z INFO  pherry] ..req_sync_para_header: SyncedTo { synced_to: 262691 }
phala-pherry      | [2021-11-04T11:21:05Z INFO  pherry] get_block (w/changes): Got block Some(262459) hash 0x1abc…510a
phala-pherry      | [2021-11-04T11:21:06Z INFO  pherry] get_block (w/changes): Got block Some(262460) hash 0xc911…9de3
phala-pherry      | [2021-11-04T11:21:07Z INFO  pherry] get_block (w/changes): Got block Some(262461) hash 0xca43…509b
phala-pherry      | [2021-11-04T11:21:07Z INFO  pherry] get_block (w/changes): Got block Some(262462) hash 0x3189…a9c6
phala-pherry      | [2021-11-04T11:21:07Z INFO  pherry] get_block (w/changes): Got block Some(262463) hash 0x42fa…5b6e
phala-pherry      | [2021-11-04T11:21:08Z INFO  pherry] get_block (w/changes): Got block Some(262464) hash 0xf034…f2b2
phala-pherry      | [2021-11-04T11:21:09Z INFO  pherry] get_block (w/changes): Got block Some(262465) hash 0x248d…5f95
phala-pherry      | [2021-11-04T11:21:09Z INFO  pherry] get_block (w/changes): Got block Some(262466) hash 0x8b47…56dd
phala-pherry      | [2021-11-04T11:21:10Z INFO  pherry] get_block (w/changes): Got block Some(262467) hash 0x4a7a…325a
phala-pherry      | [2021-11-04T11:21:10Z INFO  pherry] get_block (w/changes): Got block Some(262468) hash 0x4674…ca00
phala-pherry      | [2021-11-04T11:21:11Z INFO  pherry] get_block (w/changes): Got block Some(262469) hash 0xb16d…1f77
phala-pherry      | [2021-11-04T11:21:11Z INFO  pherry] get_block (w/changes): Got block Some(262470) hash 0x7bc5…22ae
phala-pherry      | [2021-11-04T11:21:11Z INFO  pherry] get_block (w/changes): Got block Some(262471) hash 0xcccd…9ab4
phala-pherry      | [2021-11-04T11:21:12Z INFO  pherry] get_block (w/changes): Got block Some(262472) hash 0x6e79…119e
phala-pherry      | [2021-11-04T11:21:12Z INFO  pherry] get_block (w/changes): Got block Some(262473) hash 0x3b5d…8724
phala-pherry      | [2021-11-04T11:21:12Z INFO  pherry] get_block (w/changes): Got block Some(262474) hash 0x226e…659a
phala-pherry      | [2021-11-04T11:21:13Z INFO  pherry] get_block (w/changes): Got block Some(262475) hash 0x7b03…96c5
phala-pherry      | [2021-11-04T11:21:13Z INFO  pherry] get_block (w/changes): Got block Some(262476) hash 0x5886…3f34

5) sudo docker-compose ps

     Name                   Command                State             Ports
-----------------------------------------------------------------------------------
phala-node       /usr/bin/tini -- /bin/bash ...   Exit 143
phala-pherry     /root/pherry -r --parachai ...   Up
phala-pruntime   /usr/bin/tini -- /bin/bash ...   Up         0.0.0.0:8000->8000/tcp
investoppo commented 3 years ago

any update on this ? my node sync less than 20blocks per minute

h4x3rotab commented 3 years ago

I don't have an immediate solution. However we have used to see similar problems in the past, and it was solved by adding the following command line flag to the full node:

--state-cache-size 671088640 --db-cache 2048 --max-runtime-instances 16

However, we have also seen significant performance regression after the upgrade of jsonrpsee. We haven't done any performance test so far. So probably it's worth it to create a benchmark suite and run it with the CI.

Re. this issue, could you help check the following?

  1. Check if the network is the bottleneck: you can run a pherry and pruntime on the same machine and see if it can sync faster
  2. Check the cpu usage of the machine running the node and pherry / pruntime.
  3. Check if the same setup on an old version of the substrate or pherry can be better. However v0.9.12 introduced a new database format. So you cannot really downgrade the current node to an older version. It may take a few days to just sync another node... On the other hand, we should add a benchmark feature to pherry so that even you only have a archive parachain node, we can still run the regression test. In this way we can leave the relaychain in pruned mode.
darknight7 commented 3 years ago

The node flags are correct.

A quick note: I would run pherry on the same machine with the node but I'd need to run one copy for each miner, right? I have setup one pherry for testing and will report its performance. I can't run pruntime on the node machine, it's not sgx capable. Can't run the bridge either due to other reasons otherwise I would.

  1. I switched to a separate setup because the full node was taking too much cpu away from the miner. Sync was faster with all daemons on the same machine but I switched like a month ago and after the switch to separate setups it took longer but was finished in less than 24 hours even from a remote host, it didn't take over a week, which is my current estimate for this sync round. Have the blockchains grown so much that the speed is the same but the time required is 10x? Unless network traffic has increased accordingly I'd rule out the network. Average traffic on full node is 4-5Mbit up and downstream. Ping is less than 0.4ms so I'm not too convinced it's a network problem right now, it may be contributing though.
  2. Node machine: uptime load 3.26, 4.08, 4.39; khala-node uses 270-370% cpu (4 core). Pherry machine: uptime load 0.00 0.00 0.00; pherry never went above 20% cpu while I watched, staying most of the time under 1%.
  3. Can't comment on this, I don't feel I have enough knowledge.

Thanks for the support so far. I'll add the results of the pherry running locally on the full node machine tomorrow, it doesn't look any faster right now. For some reason it started only a bit behind where it was before. This suggests that the status is actually saved in pruntime and pherry is just a "middleman"? If so, then my proposal about saving pherry status should be modified: pruntime should save its sync status on a clean exit instead to avoid starting over.

h4x3rotab commented 3 years ago

Hmmm. I think there's one more thing to try: I assume you run the full node on one computer, and pherry + pruntime on the worker computer. Could you try to move one pherry from the worker to the full node host? It may distinguish if the slowness was introduced by the ws connection (node - pherry) or the http connection (pherry - pruntime).

investoppo commented 2 years ago

Hi, i am running solo mining, what to check and update? As more than 48hrs after restart, still less than 20blocks /per minute.

darknight7 commented 2 years ago

Ok, after ~36 hours of pherry running on the same host with the full node (on behalf of just one pruntime of course) there's maybe a slight improvement, but nothing substantial. Of course I can (with time) setup a full stack (node+pherry+pruntime) on one of the machines currently running pruntime to see if it makes a difference but I need to swap discs to have more storage space etc. I can try and put the node on more powerful hardware, it will again take some time. Still, before the upgrade even a 4 core nuc could sync very, very much faster, with pruntime mining simultaneously. Granted, the nuc is very recent hardware, the 4 core running node now is older, but it's a desktop cpu, not a low power one. A log snippet, as you can see speed has not changed much as far as I can tell.

pherry-hez    | [2021-11-07T13:14:01Z INFO  pherry] get_block (w/changes): Got block Some(199147) hash 0xba89…0557
pherry-hez    | [2021-11-07T13:14:06Z INFO  pherry] get_block (w/changes): Got block Some(199148) hash 0xe440…c281
pherry-hez    | [2021-11-07T13:14:13Z INFO  pherry] get_block (w/changes): Got block Some(199149) hash 0x48d2…b3cc
pherry-hez    | [2021-11-07T13:14:15Z INFO  pherry] get_block (w/changes): Got block Some(199150) hash 0x30a1…3812
pherry-hez    | [2021-11-07T13:14:18Z INFO  pherry] get_block (w/changes): Got block Some(199151) hash 0x4524…546e
pherry-hez    | [2021-11-07T13:14:20Z INFO  pherry] get_block (w/changes): Got block Some(199152) hash 0x109b…d278
pherry-hez    | [2021-11-07T13:14:21Z INFO  pherry] get_block (w/changes): Got block Some(199153) hash 0x60d0…73c4
pherry-hez    | [2021-11-07T13:14:22Z INFO  pherry] get_block (w/changes): Got block Some(199154) hash 0x882d…9197
pherry-hez    | [2021-11-07T13:14:22Z INFO  pherry] get_block (w/changes): Got block Some(199155) hash 0xd825…2cf7
pherry-hez    | [2021-11-07T13:14:23Z INFO  pherry] get_block (w/changes): Got block Some(199156) hash 0xd39e…8718
pherry-hez    | [2021-11-07T13:14:23Z INFO  pherry] get_block (w/changes): Got block Some(199157) hash 0xe218…ef0f
pherry-hez    | [2021-11-07T13:14:24Z INFO  pherry] get_block (w/changes): Got block Some(199158) hash 0x3508…93a8
pherry-hez    | [2021-11-07T13:14:29Z INFO  pherry] get_block (w/changes): Got block Some(199159) hash 0x8b01…2852
pherry-hez    | [2021-11-07T13:14:30Z INFO  pherry] get_block (w/changes): Got block Some(199160) hash 0x6d3e…482f
pherry-hez    | [2021-11-07T13:14:35Z INFO  pherry] get_block (w/changes): Got block Some(199161) hash 0xa3bd…dbf6
pherry-hez    | [2021-11-07T13:14:39Z INFO  pherry] get_block (w/changes): Got block Some(199162) hash 0xc017…82bf
pherry-hez    | [2021-11-07T13:14:43Z INFO  pherry] get_block (w/changes): Got block Some(199163) hash 0x2495…1d06
pherry-hez    | [2021-11-07T13:14:45Z INFO  pherry] get_block (w/changes): Got block Some(199164) hash 0x96c6…fa62
pherry-hez    | [2021-11-07T13:14:50Z INFO  pherry] get_block (w/changes): Got block Some(199165) hash 0x3e77…9930
pherry-hez    | [2021-11-07T13:14:53Z INFO  pherry] get_block (w/changes): Got block Some(199166) hash 0xdd03…fb81
pherry-hez    | [2021-11-07T13:14:56Z INFO  pherry] get_block (w/changes): Got block Some(199167) hash 0x52ae…afec
pherry-hez    | [2021-11-07T13:14:57Z INFO  pherry] get_block (w/changes): Got block Some(199168) hash 0xf4dc…fd9b
pherry-hez    | [2021-11-07T13:15:01Z INFO  pherry] get_block (w/changes): Got block Some(199169) hash 0xf79f…b782
pherry-hez    | [2021-11-07T13:15:04Z INFO  pherry] get_block (w/changes): Got block Some(199170) hash 0xb3e1…df04
pherry-hez    | [2021-11-07T13:15:09Z INFO  pherry] get_block (w/changes): Got block Some(199171) hash 0xfb25…2076
pherry-hez    | [2021-11-07T13:15:14Z INFO  pherry] get_block (w/changes): Got block Some(199172) hash 0xe497…b35b
pherry-hez    | [2021-11-07T13:15:16Z INFO  pherry] get_block (w/changes): Got block Some(199173) hash 0x19e3…0cfe
pherry-hez    | [2021-11-07T13:15:24Z INFO  pherry] get_block (w/changes): Got block Some(199174) hash 0x0397…7dd1
pherry-hez    | [2021-11-07T13:15:28Z INFO  pherry] get_block (w/changes): Got block Some(199175) hash 0xe963…9894
pherry-hez    | [2021-11-07T13:15:33Z INFO  pherry] get_block (w/changes): Got block Some(199176) hash 0xee26…5f9a
pherry-hez    | [2021-11-07T13:15:37Z INFO  pherry] get_block (w/changes): Got block Some(199177) hash 0xb173…0886
pherry-hez    | [2021-11-07T13:15:39Z INFO  pherry] get_block (w/changes): Got block Some(199178) hash 0xc571…9377
pherry-hez    | [2021-11-07T13:15:44Z INFO  pherry] get_block (w/changes): Got block Some(199179) hash 0x288a…5c4e
pherry-hez    | [2021-11-07T13:15:47Z INFO  pherry] get_block (w/changes): Got block Some(199180) hash 0xa020…2455
pherry-hez    | [2021-11-07T13:15:49Z INFO  pherry] get_block (w/changes): Got block Some(199181) hash 0x2ca7…a34a
pherry-hez    | [2021-11-07T13:15:54Z INFO  pherry] get_block (w/changes): Got block Some(199182) hash 0x269c…a365
pherry-hez    | [2021-11-07T13:15:59Z INFO  pherry] get_block (w/changes): Got block Some(199183) hash 0x5dd5…ed0d

I'm wondering if this could be a local problem but I can't imagine the reason. No one else reported anything except me and investoppo above?

niklasad1 commented 2 years ago

@h4x3rotab

However, we have also seen significant performance regression after the upgrade of jsonrpsee. We haven't done any performance test so far. So probably it's worth it to create a benchmark suite and run it with the CI.

You mean https://github.com/paritytech/jsonrpc/ right? If you find a regression please report it, so we know :)

h4x3rotab commented 2 years ago

Sorry guys. We are run out of the resource. Now focusing on #500. When it's done we will get back to invest more on this problem.

@niklasad1 Thanks. I'd love to take a closer look later and will definitely get back to you when got any finding.

darknight7 commented 2 years ago

No pressure at all, definitely lower priority than #500 meanwhile I made another test: moved the node on a 8 vcpu vm: khala and kusama sync was FAST but regarding pherry sync no significant performance increase unfortunately :/

darknight7 commented 2 years ago

I've made the last missing test: vanilla solo mining with all three containers running on a single machine as they used to in the beginning. Again no noticeable difference. Now I feel I've done my part properly. :)

investoppo commented 2 years ago

still no finding ? my two solo mining CPU are still slow in sync .. believe has been 12-days and still not fully sync. decided to stop 1 CPU.. hope it solve (low chance)

darknight7 commented 2 years ago

@investoppo by any chance: is your /var/khala-dev-node on a mechanical HDD?

I can't make a test to prove this theory until I buy a new large SSD but I think this is what happened: when the last emergency patch was released my 512GB disks were choking, so I moved the node on the only hardware I had, a mechanical hdd that I put in a usb3 enclosure. Plenty of bandwidth but... maybe critically slow seek times unable to keep up with requests.

I had this epiphany this morning, having put one of the miners in my home network to see what would happen. I left it syncing during the night and haven't even logged in until now but when I woke up I clearly heard the HDD struggling under heavy seek load. This may be the reason for this whole issue but I can't prove it right now.

h4x3rotab commented 2 years ago

The bandwidth is not a problem. HDD will make a very large latency. Not sure if USB makes it worse, but the random access is critical when syncing the node.

investoppo commented 2 years ago

i am not using external HDD (via USB), but not SSD also. I doubt HDD is the problem, as it was working fine since beginning. Issue only happen after last update. Seriously doubt why only we encounter this issue but not for others.

darknight7 commented 2 years ago

What I can tell is that leaving everything else intact now with the blockchain data on SSD the logs look like this:

phala-pherry      | [2021-11-18T09:04:11Z INFO  pherry] get_block (w/changes): Got block Some(437930) hash 0xca6d…57c8
phala-pherry      | [2021-11-18T09:04:12Z INFO  pherry] get_block (w/changes): Got block Some(437931) hash 0xbd94…a396
phala-pherry      | [2021-11-18T09:04:12Z INFO  pherry] get_block (w/changes): Got block Some(437932) hash 0xfdcb…f8c6
phala-pherry      | [2021-11-18T09:04:13Z INFO  pherry] get_block (w/changes): Got block Some(437933) hash 0xc3b4…fc94
phala-pherry      | [2021-11-18T09:04:13Z INFO  pherry] get_block (w/changes): Got block Some(437934) hash 0xb176…c3c2
phala-pherry      | [2021-11-18T09:04:13Z INFO  pherry] get_block (w/changes): Got block Some(437935) hash 0xefeb…be23
phala-pherry      | [2021-11-18T09:04:14Z INFO  pherry] get_block (w/changes): Got block Some(437936) hash 0x8bf9…6dc0
phala-pherry      | [2021-11-18T09:04:14Z INFO  pherry] get_block (w/changes): Got block Some(437937) hash 0x7133…dc00
phala-pherry      | [2021-11-18T09:04:14Z INFO  pherry] get_block (w/changes): Got block Some(437938) hash 0xc4c9…d798
phala-pherry      | [2021-11-18T09:04:15Z INFO  pherry] get_block (w/changes): Got block Some(437939) hash 0x58c7…cd9b
phala-pherry      | [2021-11-18T09:04:15Z INFO  pherry] get_block (w/changes): Got block Some(437940) hash 0x229c…3d2d
phala-pherry      | [2021-11-18T09:04:16Z INFO  pherry] get_block (w/changes): Got block Some(437941) hash 0xe085…227a
phala-pherry      | [2021-11-18T09:04:16Z INFO  pherry] get_block (w/changes): Got block Some(437942) hash 0x2101…8f93
phala-pherry      | [2021-11-18T09:04:17Z INFO  pherry] get_block (w/changes): Got block Some(437943) hash 0x21da…281b
phala-pherry      | [2021-11-18T09:04:17Z INFO  pherry] get_block (w/changes): Got block Some(437944) hash 0xa77d…0920
phala-pherry      | [2021-11-18T09:04:17Z INFO  pherry] get_block (w/changes): Got block Some(437945) hash 0x6b3b…33d7
phala-pherry      | [2021-11-18T09:04:18Z INFO  pherry] get_block (w/changes): Got block Some(437946) hash 0x2895…f90d
phala-pherry      | [2021-11-18T09:04:18Z INFO  pherry] get_block (w/changes): Got block Some(437947) hash 0x72e7…28f6
phala-pherry      | [2021-11-18T09:04:19Z INFO  pherry] get_block (w/changes): Got block Some(437948) hash 0x8268…cd71
phala-pherry      | [2021-11-18T09:04:19Z INFO  pherry] get_block (w/changes): Got block Some(437949) hash 0x39c7…0f0f
phala-pherry      | [2021-11-18T09:04:20Z INFO  pherry] get_block (w/changes): Got block Some(437950) hash 0xd1f1…575b
phala-pherry      | [2021-11-18T09:04:20Z INFO  pherry] get_block (w/changes): Got block Some(437951) hash 0xb456…9347

Honestly not as fast as I'd want it to be, still it's an order of magnitude faster than before. Based on this I would also recommend adding a serious warning about using HDDs to store the blockchain in the documentation.