aws-samples / aws-blockchain-node-runners

Run blockchain nodes on cloud
https://aws-samples.github.io/aws-blockchain-node-runners/
MIT No Attribution
57 stars 47 forks source link

Remove dubious arguments from solana node configuration #28

Closed t-nelson closed 10 months ago

t-nelson commented 11 months ago

What does this PR do?

Removes configuration options from the solana node configuration that are likely to lead to poor performance and/or unnecessary support burden

Motivation

a flood of new operators with poorly performing nodes seeking support on the solana tech discord server

More

For Moderators

Additional Notes

please have some support staff keep an eye on https://solana.com/discord

bji commented 11 months ago

I'm not a reviewer but this looks like a great change to me.

bartenbach commented 11 months ago

I can only echo my approval as a participant in the Solana ecosystem

frbrkoala commented 10 months ago

Thank you very much for this PR. I'm running "smoke" tests to check if everything is working well before merging. Should wrap it up by the end of this week, so please stay tuned!

t-nelson commented 10 months ago

this is actually one of the worst flags to remove. we regularly get folks on discord with nodes not able to connect to the network. they have copypasta'd that flag from somewhere, misconfigured their firewall and don't get log messages letting them know what's wrong as a result.


which port(s) specifically are violating the restrictions? the tests that flag disables should not be accessing any ports that the validator doesn't need to operate

bartenbach commented 10 months ago

Yeah, that doesn't make sense to me. --no-port-check removes the sanity check that the ports you declare are actually open and can be used.

Skipping the port check and then trying to use RPC services on a closed port doesn't make a whole lot of sense.

frbrkoala commented 10 months ago

Thank you for clarifications. I can assure you that it will not be a problem in this setup because firewall rules are controlled in VPC security group by CDK and not the scripts within the EC2 instance (see details below). To limit the room for errors I explicitly specify the port numbers in the startup scripts too.

Here are the ports we control:

Link to the code.

The test initiates connection from the outside to all necessary ports, including 8899 and 8900, which can't be accessed by design. Here is the log message captured during startup: [2024-01-08T00:29:05.193062671Z INFO solana_net_utils] Checking that tcp ports [(8899, TcpListener { addr: 0.0.0.0:8899, fd: 87 }), (8900, TcpListener { addr: 0.0.0.0:8900, fd: 88 }), (8801, TcpListener { addr: 0.0.0.0:8801, fd: 89 })] are reachable from 34.83.231.102:8001

Opening these ports to public without extra protection exposes nodes to security risks caused by undesirable traffic to RPC. This will not pass our security review. What are your usual recommendations for protecting ports 8899 and 8900?

t-nelson commented 10 months ago

This will not pass our security review. What are your usual recommendations for protecting ports 8899 and 8900?

you can pass --private-rpc. this will affect a few things

  1. omits the rpc/ws ports from the startup port checks
  2. omits the rpc/ws ports from being published with gossip contact info
  3. changes the default rpc/ws bind address to localhost, rather than adopting the gossip bind address (override with --rpc-bind-address ADDRESS)

note that this does not necessarily restrict foreign addresses from connecting to the service! that remains the job of a firewall and/or reverse proxy

does this achieve an acceptable result?

frbrkoala commented 10 months ago

Yes, that should work in theory. Let me test it today and I'll get back to you.

frbrkoala commented 10 months ago

All right. The test is not successful so far. With --private-rpc and --rpc-bind-address ADDRESS but without --no-port-check the validator process starts syncing while also gradually increasing the amount of consumed RAM. Eventually RAM it runs out of memory and crashes without fully syncing. I tested on version v1.16.25. Will try switching to v1.16.26 tomorrow.

t-nelson commented 10 months ago

nothing between those two versions would affect memory usage. nor, removal of any of the proposed cli flags.

what are the specs of the test machine?

frbrkoala commented 10 months ago

I upgraded to version 1.17.16 and ran another test. The memory consumption problem has not repeated itself. I also made a few more changes and bug fixes, pushed to another branch and opened PR #32 . Closing this PR and let's move our conversation to PR #32 . @t-nelson will be great if you can have one more look and give your feedback.