Lost validator private keys when cloud native secret storage is not used

CurralesDragon commented 1 year ago

I was adding validators on a multi cluster setup, with the cloud native storage setting disabled. The validator connected to the network, I voted it in (x2 nodes)

I restarted the pod, and now it’s generated a new address (even though the secret with keys has persisted)

now I have faulty nodes which has halted the network. And I don’t have access to these private keys.

Is there a solution to either recover the keys, or roll back the chain to before voting the new unusable nodes?

This is on production

CurralesDragon commented 1 year ago

It seems any time the pod is stopped and installed it is not using the secret with the validator keys. And generates new keys which are not stored

joshuafernandes commented 1 year ago

ver the keys, or roll back the chain to before voting the new unusable nodes?

Hello @nicwhitts

Please refer to the docs https://docs.goquorum.consensys.net/tutorials/kubernetes/production or https://besu.hyperledger.org/development/private-networks/how-to/deploy/kubernetes depending on your client. You still need to follow best practices or keeping secrets safe either in vault (or equivalent) or encrypted in some sort of source control.

Keys are one way only which guarantees safety - once the keys are deleted or lost, they cannot be recovered. If you are using cloud AWS and Azure offer you Secrets Manager or Keyvault and the charts integrate with both services - both have capabilities for soft deletes ie previous versions are kept for events like these.

For a validator we have safe defaults to prevent keys being deleted https://github.com/Consensys/quorum-kubernetes/blob/master/helm/values/validator.yml#L5.

I'd recommend using a staging environment with n clusters and getting things working there - including scenarios like the above and work through adding/removing validators, data backup, monitoring etc. The most important thing is to plan it out on paper first https://docs.goquorum.consensys.net/tutorials/kubernetes/production#best-practices

CurralesDragon commented 1 year ago

Hi, thanks for the information.

I did run this through a staging environment and everything seemed to be working fine. (Everything connects, but I didn’t test restarting pods - Iv been running this for 2 years with cloud native services without any issues) restarting pods always works.

Just to clarify, the keys are persisted. But the issue is when cloud native services is not enabled, the validators when started ignore the secret and generate new keys.

Specifically, when I was adding the new validators. I used the config map specifying the address <….validator-1-address>
But this is different every time the pod is restarted

When restarting the node. A new address is then generated. Now leaving me with less validator nodes for the 2/3 requirement for liveness.

joshuafernandes commented 1 year ago

Hello @nicwhitts

Im not sure what changes you have made to the repo to get your setup working or what your environment is and cannot comment on that. It is good that you have had it running for 2 years. The intent of these charts is to be run cloud - if you are running it with cloudNativeServices: False

They keys will persist if Values.quorumFlags.removeKeysOnDelete: False is set

Just to clarify, the keys are persisted.

I'm not sure what the issue is here either, the title says the keys are lost but you're previous post says they persist which in line with the default behaviour of the charts

Specifically, when I was adding the new validators. I used the config map specifying the address <….validator-1-address>

Can you elaborate on this please? How are you adding in new validators and how is the client getting the secrets and using it secret? Have you verified that the validator key's you are providing are mounted to the system where you expect?

When restarting the node. A new address is then generated. Now leaving me with less validator nodes for the 2/3 requirement for liveness.

This is normal behaviour for Besu or GoQuorum - unless you provide the cli arg to tell it which key to use or move the key to the default key location, it will create a new key.

Please refer to https://docs.goquorum.consensys.net/tutorials/private-network/adding-removing-ibft-validators#docusaurus_skipToContent_fallback to add or remove validators

and the k8s config for the validators is https://docs.goquorum.consensys.net/tutorials/kubernetes/deploy-charts#5-deploy-the-validators and https://docs.goquorum.consensys.net/tutorials/kubernetes/deploy-charts#5-deploy-the-validators

CurralesDragon commented 1 year ago

Hello @nicwhitts

Im not sure what changes you have made to the repo to get your setup working or what your environment is and cannot comment on that. It is good that you have had it running for 2 years. The intent of these charts is to be run cloud - if you are running it with cloudNativeServices: False

They keys will persist if Values.quorumFlags.removeKeysOnDelete: False is set

Just to clarify, the keys are persisted.

I'm not sure what the issue is here either, the title says the keys are lost but you're previous post says they persist which in line with the default behaviour of the charts

So the key secret when first run is persisted, but then any time the pod is stopped and restart. The validator address is different.

Specifically, when I was adding the new validators. I used the config map specifying the address <….validator-1-address>

Can you elaborate on this please? How are you adding in new validators and how is the client getting the secrets and using it secret? Have you verified that the validator key's you are providing are mounted to the system where you expect?

I am using no modifications from the current charts in the repo.

When restarting the node. A new address is then generated. Now leaving me with less validator nodes for the 2/3 requirement for liveness.

This is normal behaviour for Besu or GoQuorum - unless you provide the cli arg to tell it which key to use or move the key to the default key location, it will create a new key.

This is what I was not expecting, because the keys are pulled as expected from the secret manager for aws and the validator address therefore is persisted.

I was expecting the same behaviour when cloud native servers is disabled.

Appreciate the support - I actually got the network back online by recovering one of the nodes and have managed to remove the addresses from the validator pool. I will do some more testing and modifications to get the same behaviour. I would expect even without cloud services enabled, that the validators should persist on restarts.

Consensys / quorum-kubernetes

Lost validator private keys when cloud native secret storage is not used #208