chainflip-io / support

Public ticketing system for Perseverance issues tracking
2 stars 0 forks source link

[REPORT] heavy slashing for migrated validators (p2p identity not changing?) #3

Open addiaddiaddi opened 1 year ago

addiaddiaddi commented 1 year ago

Description: We had two validators running on a digital ocean vps. For the smoothops program, we decided to migrate them to a bare metal machine. We run five total validators on this setup. The two validators that were migrated experienced significant slashing - this did not make sense since we turned off the old validators and followed the migration instructions. The three other validators on the exact same setup experienced no slashing and functioned perfectly.

While trying to debug the issue, we observed a quantity of logs relating to p2p. We can see several thousand logs per minute relating to p2p on the node experiencing slashing. image

To the contrary, the functional validator on the exact same setup experienced a fraction of the p2p logs. image

This difference implies that the issue is related to p2p, since keygen logs, chainflip-node logs, and other items remained exactly identical.

While staking a new validator, I observed the new validator submitting this extrinsic upon registering its role as a Validator. image

This implies there is some sort of on-chain registry of p2p identity.

However, when trying to run sudo chainflip-cli --config-root /etc/chainflip register-account-role Validator on a validator that has already set its role, the extrinsic fails and no p2p identity extrinsic is sent. The migrated validators did not submit this p2p extrinsic (i could have missed it though, the frontend was having some issues)

Let me know if you have any questions.

The two addresses are:

cFKspn1yMpzSzJDNehutvNAFzyUJcBo1ANAUiHhxfBHyvdN7x
cFKWJX6PqAFS2b7Nd3ZxPoDNz3fQuCLiW2bpECdgThLyhF2Nk

Environment: Operating System: Ubuntu20.04 System Specification: OVH 16 core dedicated, 64GB RAM, 1Gbps up/down, Germany, 2*960GB NVMe disk Version: chainflip-latest, running in docker (repo)

ofek-ts commented 1 year ago

I'd like to add that we verified that the p2p port is indeed open. I've done netcat from a different machine and that port is open and listening.

[node_p2p]
node_key_file = "/etc/chainflip/keys/node_key_file"
ip_address = "[redacted]"
port = "8079"
nickeyinho commented 1 year ago

Sanity Check

Description:

I migrated to another machine, previously it was also Hetzner 4/8/160, as I was selected to the Smooth operators group I decided to migrate using your official migration guide. I had no problems until I moved. Now I use Infura(http) & Alchemy(ws) to be safe. but nothing helps, I'm getting slashed at times and want to investigate and fix the issue. I dont think that problem is about RPC, in case previously I used only Infura for both (ws & http). As I see, I migrated successfully, as my node works and in auth set, but I lose tFLIPS and reputation.

Latest slashes:

  1. https://blocks-perseverance.chainflip.io/block/0x39bb212a143c8c8eb9597bdfbcc254e648121a6c09b1ad8a5404bc8b0d3fcf83
  2. https://blocks-perseverance.chainflip.io/block/0x311222263532b5ad6d28d1c4de0b213ed185d112d86578f4d2dfaf6e0eb0df5e

ID: cFLTSe857hxiFmKdwz2CnNz8rmgGyzXgkatXzJF8W28qZGwUi

Last thing I can try to switch WS to Quick Node, but I dont think its a problem as I said I used earlier only Infura and it was perfect.

Environment:

Operating System: Ubuntu20.04 System Specification: Hetzner CPX41 8 core shared, 16GB RAM, 1Gbps up/down, Finland, 240 NVMe disk Version: chainflip-engine 0.7.3, chainflip-node 0.7.2

Haafingar commented 1 year ago

Thanks for submitting this.

Definately seems like there is a p2p authentication issue. You dont still have the old instance running on the old machine at the same time do you?

We will investigate this, but as we stated in the docs, we recommend not migrating nodes, ever. Unstake and then restaking is way cleaner, easier, and arguably faster.

ofek-ts commented 1 year ago

Thanks for submitting this.

Ofc 😄

Definately seems like there is a p2p authentication issue. You dont still have the old instance running on the old machine at the same time do you?

Yes, we don't have the previous one running.

We will investigate this, but as we stated in the docs, we recommend not migrating nodes, ever. Unstake and then restaking is way cleaner, easier, and arguably faster.

I understand that it is not recommended but there are occasions where you don't really have a choice - what if the physical machine fails? In that scenario you have to bring up your validator to life again, and unstaking isn't instant (like in an auction period where you can't unstake), so you're kind of forced to migrate it to avoid getting slashed rather than unstake