emeraldpay / dshackle

Fault Tolerant Load Balancer for Ethereum and Bitcoin APIs
Apache License 2.0
308 stars 55 forks source link

Basic Fault Tolerance is not working #239

Closed syspulse closed 1 year ago

syspulse commented 1 year ago

I have two upstream nodes configured as primary and secondary/fallback. When primary node is not available/fails at network level, secondary node is not tried

Docker: emeraldpay/dshackle:0.13.1

Request: curl -i localhost:8545/ethereum -POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x01", true],"id":100}'

Log:

2023-01-05 14:06:08.196 | WARN  |       CompoundReader | Failed to read from io.emeraldpay.dshackle.upstream.ethereum.EthereumDirectReader$2@5f9adf81                                                                                                                     
reactor.core.Exceptions$RetryExhaustedException: Retries exhausted: 3/3

....

Caused by: io.emeraldpay.dshackle.upstream.rpcclient.JsonRpcException: op-2: Name or service not known

Configuration:

version: v1

proxy:
  host: 0.0.0.0
  port: 8545
  routes:
    - id: ethereum
      blockchain: ethereum

cluster:
  upstreams:
    - id: op1
      chain: ethereum
      role: primary
      options:
        disable-validation: true
      connection:
        ethereum:
          rpc:
            url: "http://op-2:8545"
    - id: op2
      chain: ethereum
      role: fallback
      options:
        disable-validation: true
      connection:
        ethereum:
          rpc:
            url: "https://mainnet.optimism.io"

Expected behavior: After failed primary upstream, secondart/fallback node is tried

splix commented 1 year ago

op-2: Name or service not known. Can you please check that Dshackle can resolve the hostname op-2?

syspulse commented 1 year ago

It cannot on purpose, this is the whole point. I'm testing a topology when primary is not available due to networking issues

splix commented 1 year ago

I have identified the root of the issue.

When you set disable-validation: true, it essentially instructs Dshackle to always consider the upstream. This disables all checks, making it a potentially dangerous and breaking option by itself. And it seems it is incompatible with the primary/fallback roles, as Dshackle will always attempt to use the primary, believing it never fails (even if the host cannot be resolved).

I am uncertain how to address this issue, as it is essentially the expected behavior. However, for problems like invalid hostname, it seems illogical.

May I ask why you want to disable the validation? Is it because Dshackle is unable to validate Optimism? I have never tried using it with Optimism, so I am unsure if there are any differences. But maybe I can add an alternative option(s) to ensure compatibility with Optimism.

syspulse commented 1 year ago

I was under impression that disable-validation was about checking if node is synced. I actually don't need nodes to be synced and checked with RPC calls.

I removed disable-validation and configured geth instances and it seems to be working with this behavior:

1) First request is always quite long and I see the log: Multistream | State of ETH: height=17181425, status=[UNAVAILABLE/2], lag=[0, 0], weak=[op1, op2].

2) When primary node shuts down (clean TCP fin), the request is still retried and there is a noticable delay. Secondary node is always available

How do I disable in-memory cache ?

splix commented 1 year ago

disable-validation is more general and basically is a shortcut that makes the upstream as always OK.

The first request seems to be slow because [UNAVAILABLE/2] so it waits until one of the upstreams becomes available. Similar issue, I think, is for the second case. It doesn't immediately learns that the upstream is down and tries it for a some time. I'm going to make a new release in the following week, and with the new release the process of discovering failed upstreams should be more smooth.

The in memory cache cannot be disabled now, and it's needed for some internal operations, though I think I can come with some options to tune it. Is your main concert about the memory use for caching?

syspulse commented 1 year ago

The first request seems to be slow because [UNAVAILABLE/2] so it waits until one of the upstreams becomes available.

I start with all nodes available immediately (I can see succesful checks for latest blocks on the network). I am not sure why it waits so long to be marked as Available. Second case I can understand why. Just hoped it would work like fast fault tolerant Loadbalancing: no connection - go to the next, retry connection in the background; timeout - go the next, don't exhaust retrying ;-)