New healthcheck status for keys loading okay too early

ll873 commented 1 year ago

While testing https://github.com/ConsenSys/web3signer/pull/738 I've noticed that the healthcheck endpoint returns 200 too early. Specifically after the following log line it returns 200.

2023-04-10 18:28:35.528+00:00 | ForkJoinPool-2-worker-1 | INFO  | SignerLoader | Total configuration metadata files processed: 35000
2023-04-10 18:28:35.528+00:00 | ForkJoinPool-2-worker-1 | INFO  | SignerLoader | Total signers loaded from configuration files: 35000 in 00:00:41.554 with error count 0

But the metrics endpoint correctly still reports 0 keys loaded at this point for metric signing_signers_loaded_count.

This metric only gets updated after the following logs lines:

2023-04-10 19:20:13.896+00:00 | pool-2-thread-1 | INFO  | RegisteredValidators | Validators registered successfully in database:3500
2023-04-10 19:20:14.389+00:00 | pool-2-thread-1 | INFO  | DefaultArtifactSignerProvider | Total signers (keys) currently loaded in memory: 35000

This is an issue because if signing requests get sent to Web3signer as soon as the healthcheck returns 200 it will not have the keys to sign with.

I'm not sure if it's related but I've also noticed that sometimes Web3signer just hangs between the two sets of log messages and never fully loads the keys.

I am using Web3signer with Hashicorp Vault as the key storage.

non-fungible-nelson commented 1 year ago

@usmansaleem can you take a look at this?

usmansaleem commented 1 year ago

@ll873 Your observation is correct, we are registring the healthcheck endpoint very early which would result it to start returning 200 as soon as the server is up and keys are loaded successfully. Perhaps the correct approach is to register (or enable) this endpoint as the very last step, meanwhile it would return 404 while being probed by k8 or other orchestration frameworks. Let me know if this is the desired behavior.

As for the second issue, we have an open ticket to improve hashicorp loading as we attempt to open new connection to hashicorp when dealing with configuration files instead of sharing connection for same credentials. That might solve the issue of lagging in hashicorp. The other possibility is to introduce "bulk load" mode for hashicorp similar to AWS and Azure. If your scenario is similar to bulk load mode, can you create a github issue so that we can prioritise it.

usmansaleem commented 1 year ago

Let me try to reproduce it at my end, we are supposed to start the web3signer "after" everything is loaded (or had failures).

ll873 commented 1 year ago

@usmansaleem My expectation is that when the server returns 200 on the healthcheck endpoint it can respond to signing requests. This means we can use it for Kubernetes readinessProbes or AWS LB checks.

So from what I've observed this only happens after the keys are loaded into memory.

usmansaleem commented 1 year ago

@ll873 Wanted to update you regarding this ticket. After detailed testing, I can confirm that Web3Signer would only start accept signing, healthcheck, ready requests after it process and loads all the keys. The metrics server is a different process and it actually starts quite early during the initialisation steps. we are considering to move metrics server "start" after web3signer starts accepting connections. I have set up some docker compose and shell script for my own testing. https://github.com/usmansaleem/signers_docker_compose/tree/main/web3signer-hashicorp

Regarding using large number of hashicorp configuration files issue, I am working on enhancing it and will update you when I get further results.

usmansaleem commented 1 year ago

Closing the ticket, please re-open if further clarification is required.

ll873 commented 1 year ago

@usmansaleem Sorry for the late reply. What you're seeing is the exact opposite of what I'm seeing. I see a 200 response before keys are fully loaded into memory and multiple failed attempts to sign.

This situation gets worse when the Web3signer is under memory pressure.

We have had to completely stop using the healthcheck endpoint on startup because of this issues as with 35000 keys it leads to several hundred failed signings.

Is it possible that when using Vault as the backend the keys are loaded later?

usmansaleem commented 1 year ago

@ll873 You are probably observing failure in loading such a large number of keys. I would recommend you to retry your setup once https://github.com/ConsenSys/web3signer/pull/761 is merged. The Web3Signer server is started once all the keys are loaded (with or without failures).

ll873 commented 1 year ago

@usmansaleem But that doesn't explain why the healthcheck endpoint returns 200 and reports all the keys loaded when it's not able to use them. If it fails to load keys it should report them as failed.

usmansaleem commented 1 year ago

Try it with 'develop' tag which contains enhanced Hashicorp handling logic. The new health check endpoint is only meant to return 200 if all keys are loaded successfully.

Consensys / web3signer

New healthcheck status for keys loading okay too early #751