Lorre is extremely aggressive on the tezos-node

ghost commented 4 years ago

Hi, I have a tezos-node (mainnet) and Lorre and Conseil API connected. I do not use your docker images, I run the processes myself.

My problem is that when lorre is running I can barely connect to tezos-node anymore, e.g. it takes 20 to 60 seconds to check /chains/main/blocks/head:

 time curl -s --noproxy "*" --connect-timeout 60 --max-time 60 -X GET -H 'Content-Type: application/json' 'http://*****/chains/main/blocks/head' | jq '.header.level'
1239159

real    0m0.050s
user    0m0.008s
sys     0m0.005s

time curl -s --noproxy "*" --connect-timeout 60 --max-time 60 -X GET -H 'Content-Type: application/json' 'http://*****/chains/main/blocks/head' | jq '.header.level'
1239163

real    0m20.426s
user    0m0.008s
sys     0m0.013s

My current Lorre config looks like this:

platforms: [ {
  name: tezos
  network: mainnet
  enabled: true
  node: {
    protocol: "http"
    hostname: "****"
    port: *****
    pathPrefix: ""
    }
  }
]

lorre {

  request-await-time: 120 s
  get-response-entity-timeout: 90s
  post-response-entity-timeout: 1s

  sleep-interval: 5 s
  bootup-retry-interval: 10 s
  bootup-connection-check-timeout: 10 s
  fee-update-interval: 20
  fees-average-time-window: 3600
  depth: newest
  chain-events: []
  block-rights-fetching: {
    init-delay: 2 minutes
    interval: 60 minutes
    cycles-to-fetch: 5
    cycle-size: 4096
    fetch-size: 200
    update-size: 16
    enabled: true
  }

  batched-fetches {
    account-concurrency-level: 5
    block-operations-concurrency-level: 10
    block-page-size: 500
    block-page-processing-timeout: 1 hour
    account-page-processing-timeout: 15 minutes
    delegate-page-processing-timeout: 15 minutes
  }

  db {
    dataSourceClass: "org.postgresql.ds.PGSimpleDataSource"
    properties {
      user: "***********"
      password: "***********"
      url: "jdbc:postgresql://************"
    }
  }

}

akka {
  tezos-streaming-client {
    max-connections: 10
    max-open-requests: 512
    idle-timeout: 10 minutes
    pipelining-limit: 7
    response-entity-subscription-timeout: 15 seconds
  }
  tezos-dispatcher {
    type: "Dispatcher"
    executor: "thread-pool-executor"
    throughput: 1

    thread-pool-executor {
      fixed-pool-size: 16
    }
  }

  http {
    server {
      request-timeout: 5 minutes
      idle-timeout: 5 minutes
    }
  }
}

I built Lorre from the master branch today.

What can I do to make it less aggressive?

ivanopagano commented 4 years ago

you can start by halving a couple of values in the akka.tezos-streaming-client section.

try something like

max-connections: 5 # <- half the number of concurrent open connections
max-open-requests: 512
idle-timeout: 10 minutes
pipelining-limit: 7
response-entity-subscription-timeout: 15 seconds

This should essentially drop the ongoing requests to half because it will use less connections

What I don't know for sure is why your tezos node should have less capacity than the one we use in our docker. Unless the node can auto-tune based on available system resources? Did you set any custom configuration to run the tezos node?

ghost commented 4 years ago

@ivanopagano Thank you for the advice - I am testing it now. I run the tezos-node like this:

tezos-node run -v --history-mode=archive --data-dir=/tezos --network=mainnet --rpc-addr=0.0.0.0:8732 --config-file=mainnet.json --connections=5

whereas mainnet.json contains:

{
  "data-dir": "/tezos",
  "p2p": {
    "bootstrap-peers": [
      "boot.tzbeta.net",
      "dubnodes.tzbeta.net:9732",
      "franodes.tzbeta.net:9732",
      "sinnodes.tzbeta.net:9732",
      <... many more peers ...>
    ],
    "listen-addr": "[::]:9732"
  }
}

ghost commented 4 years ago

@ivanopagano I tried like this:

akka {
  tezos-streaming-client {
    max-connections: 3
    max-open-requests: 256
    idle-timeout: 10 minutes
    pipelining-limit: 7
    response-entity-subscription-timeout: 15 seconds
  }
  tezos-dispatcher {
    type: "Dispatcher"
    executor: "thread-pool-executor"
    throughput: 1

    thread-pool-executor {
      fixed-pool-size: 16
    }
  }

  http {
    server {
      request-timeout: 5 minutes
      idle-timeout: 5 minutes
    }
  }
}

And it did not improve the situation. Any other idea?

ghost commented 3 years ago

I have experimented a little more. First I lower the akka values as follows:

akka {
  tezos-streaming-client {
    max-connections: 3
    max-open-requests: 128
    idle-timeout: 10 minutes
    pipelining-limit: 7
    response-entity-subscription-timeout: 15 seconds
  }
  tezos-dispatcher {
    type: "Dispatcher"
    executor: "thread-pool-executor"
    throughput: 1

    thread-pool-executor {
      fixed-pool-size: 16
    }
  }

  http {
    server {
      request-timeout: 5 minutes
      idle-timeout: 5 minutes
    }
  }
}

It still did not make a tangible difference. So as a workaround I created a new tezos-node on the same machine. So I have one that I can query and there's one for Conseil. This way it works, I get quick responses. What I learn from this is that it is not a hardware/IO/network issue as it works with more processes better than with less processes. It seems to me that there's something like a "max-rpc-calls-per-second" limit on the tezos-node or Conseil ignores my akka config?

ghost commented 3 years ago

Hi There, any news on this? Do you have a suggestion what to do?

ghost commented 3 years ago

Hi, any idea what I shall do? The advises I received did not have any effect. Conseil keeps paralyzing the tezos-node.

jun0tpyrc commented 3 years ago

I got docker-compose of all those conseil+psql + tezos running for mainnet, most tunings do not help much until i determined to scale up my instance to a 8core 32GB memory one and a fast gp3 disk on aws - which together seems solved the io bottleneck for me

vishakh commented 3 years ago

Please try the latest release and let us know how it looks. There is improved logging so it should be easier to identify the root issue.

https://github.com/Cryptonomic/Conseil/releases/tag/2021-january-release-35

vishakh commented 3 years ago

@g574 @jun0tpyrc Please see the above comment about the latest release.

Cryptonomic / Conseil

Lorre is extremely aggressive on the tezos-node #950