Slow performance syncing with xdposchain v2.2.4

nikki-quant commented 1 month ago

Dear XinFin team,

We run a XDC Mainnet node in house, and around 3 weeks ago we provisioned a new v2.2.4 node on an AWS EC2 r5a.xlarge instance. We had used the same instance type previously and found the performance good enough for our purposes, syncing 1.5 million blocks in 15 minutes.

On this occasion our new node is syncing very slowly - an average of 300 blocks in a minute. It has ~17 peers, and does not seem CPU, disk or memory constrained according to top, vmstat or Cloudwatch metrics.

In terms of application logs, the only errors I see are related to the stats endpoint, which seems to be incorrect:

INFO [10-17|10:04:53] Masternodes are ready for the next epoch
INFO [10-17|10:04:53] Imported new chain segment               blocks=43  txs=448  mgas=843.869 elapsed=8.370s    mgasps=100.819 number=75496950 hash=e9d34a…6d4560 cache=4.47mB
WARN [10-17|10:04:53] Block stats report failed                err="write tcp 172.18.0.2:52730->45.82.64.150:3000: use of closed network connection"
WARN [10-17|10:04:53] Post-block transaction stats report failed err="write tcp 172.18.0.2:52730->45.82.64.150:3000: use of closed network connection"
WARN [10-17|10:05:01] Failed to retrieve stats server message  err="read tcp 172.18.0.2:51618->45.82.64.150:3000: read: connection reset by peer"
INFO [10-17|10:05:01] Imported new chain segment               blocks=45  txs=580  mgas=866.529 elapsed=8.123s    mgasps=106.674 number=75496995 hash=e9d2e3…ba225c cache=4.78mB
INFO [10-17|10:05:03] Persisted trie from memory database      nodes=1563 size=529.44kB time=19.076817ms gcnodes=6852  gcsize=2.64mB gctime=22.122977ms livenodes=9256  livesize=3.19mB
INFO [10-17|10:05:03] Persisted trie from memory database      nodes=0    size=0.00B    time=4.64µs      gcnodes=0     gcsize=0.00B  gctime=77.661µs    livenodes=1     livesize=0.00B
INFO [10-17|10:05:03] Persisted trie from memory database      nodes=0    size=0.00B    time=1.641µs     gcnodes=0     gcsize=0.00B  gctime=59.762µs    livenodes=1     livesize=0.00B
WARN [10-17|10:05:09] Full stats report failed                 err="write tcp 172.18.0.2:51618->45.82.64.150:3000: use of closed network connection"
INFO [10-17|10:05:09] Imported new chain segment               blocks=39  txs=622  mgas=786.045 elapsed=8.029s    mgasps=97.895  number=75497034 hash=498c05…10ad90 cache=4.59mB
INFO [10-17|10:05:18] Imported new chain segment               blocks=40  txs=533  mgas=805.272 elapsed=8.143s    mgasps=98.883  number=75497074 hash=281afa…0c7c5f cache=5.01mB
WARN [10-17|10:05:24] Failed to retrieve stats server message  err="read tcp 172.18.0.2:56718->45.82.64.150:3000: read: connection reset by peer"
WARN [10-17|10:05:25] Full stats report failed                 err="write tcp 172.18.0.2:56718->45.82.64.150:3000: use of closed network connection"
INFO [10-17|10:05:26] Imported new chain segment               blocks=44  txs=549  mgas=844.683 elapsed=8.110s    mgasps=104.148 number=75497118 hash=f0429a…60e27b cache=4.99mB
INFO [10-17|10:05:28] Persisted trie from memory database      nodes=1728 size=572.36kB time=12.981777ms gcnodes=7555  gcsize=2.91mB gctime=24.929708ms livenodes=9151  livesize=3.14mB
INFO [10-17|10:05:28] Persisted trie from memory database      nodes=0    size=0.00B    time=4.71µs      gcnodes=0     gcsize=0.00B  gctime=74.78µs     livenodes=1     livesize=0.00B
INFO [10-17|10:05:28] Persisted trie from memory database      nodes=0    size=0.00B    time=2.16µs      gcnodes=0     gcsize=0.00B  gctime=53.612µs    livenodes=1     livesize=0.00B
INFO [10-17|10:05:34] Imported new chain segment               blocks=43  txs=554  mgas=845.706 elapsed=8.009s    mgasps=105.582 number=75497161 hash=ec557f…9d8b89 cache=4.71mB
INFO [10-17|10:05:42] Imported new chain segment               blocks=42  txs=549  mgas=824.981 elapsed=8.094s    mgasps=101.919 number=75497203 hash=389103…84bd62 cache=4.74mB

I can reach the stats server with netcat or curl, I'm unsure why the application is getting TCP errors:

ubuntu@node:~$ host stats.xinfin.network
stats.xinfin.network has address 45.82.64.150
ubuntu@node:~$ nc stats.xinfin.network -zv 3000
Connection to stats.xinfin.network 3000 port [tcp/*] succeeded!

Does the node name passed as $INSTANCE_NAME need to be unique?

Are there any configuration options we can change in the node settings, or recommendations that you would make for system configuration to increase performance?

Is there any documentation on troubleshooting this kind of issue out there we should take a look at?

pro100skm commented 1 month ago

@nikki-quant , have you checked connection from the container itself?

nikki-quant commented 1 month ago

Thanks for your response. Do you mean the connection to the stats.xinfin.network?

If so, yes, I get a response from netcat and curl:

Confirming that the stats server corresponds to the IP address in the logs:

ubuntu@node:~$ host stats.xinfin.network
stats.xinfin.network has address 45.82.64.150

Netcat uses TCP unless we pass the flag -u, I don't see any issue with TCP access to the node on that port:

ubuntu@node:~$ nc stats.xinfin.network -zv 3000 
Connection to stats.xinfin.network 3000 port [tcp/*] succeeded!

Hitting that endpoint with curl, we receive a HTML document:

ubuntu@node:~$ curl http://stats.xinfin.network:3000
<!DOCTYPE html><html ng-app="netStatsApp"><head><meta name="viewport" content="width=device-width, initial-scale=1.0,

The node's securitygroup has outbound access open on all ports but some restrictions on inbound connections. AWS securitygroups will allow a response to an outgoing connection, and since TCP keeps the connection open rather than the server initiating a new connection, I would not expect this to be an an issue with networks ecurity.

pro100skm commented 1 month ago

If you are running it in docker, please enter to the container env.

Check what you are running: docker ps -a
Get needed container name (for me its mainnet-xinfinnetwork-1) and enter to container: docker exec -ti mainnet-xinfinnetwork-1 bash
Check connetion: curl 45.82.64.150:3000

What I see:

nikki-quant commented 1 month ago

Yes, I'm running in docker. I receive the same response as on the underlying host.

ubuntu@node:~$ docker ps
CONTAINER ID   IMAGE                         COMMAND                 CREATED       STATUS       PORTS                                                                                                                                               NAMES
0cbe0caa505d   xinfinorg/xdposchain:v2.2.4   "bash /work/entry.sh"   3 hours ago   Up 3 hours   8555/tcp, 0.0.0.0:30303->30303/tcp, :::30303->30303/tcp, 0.0.0.0:8989->8545/tcp, [::]:8989->8545/tcp, 0.0.0.0:8888->8546/tcp, [::]:8888->8546/tcp   mainnet-xinfinnetwork-1
ubuntu@node:~$ docker exec -it 0cbe0caa505d bash
0cbe0caa505d:/work# curl 45.82.64.150:3000 | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0<!DOCTYPE html><html ng-app="netStatsApp">

I'm using a recent copy of this repo:

commit 769d463c6dd272925b0ffad15db0c2341d8dfe81 (HEAD -> master, origin/master, origin/HEAD)
Author: Anil Chinchawale <anil24593@gmail.com>
Date:   Tue Sep 17 20:03:54 2024 +0530

    Update README.md

I have updated the docker-compose file to use the image xinfinorg/xdposchain:v2.2.4 but did not pull an updated copy of the master branch because I'm waiting for other people in my team to review the 2.2.5 changes and confirm that they'll work with our application stack.

pro100skm commented 1 month ago

Have you enabled firewall?

pro100skm commented 1 month ago

Looks like you need to allow connection between 172.18.0.2 and 45.82.64.150

pro100skm commented 1 month ago

172.18.0.2 - is the docker subnet on your server and it needs to have an access to 45.82.64.150:3000

nikki-quant commented 1 month ago

@pro100skm as the log shows, when I exec onto the Docker container, I can connect to 45.82.64.150:3000 (in the output above I receive the HTML document starting <!DOCTYPE html><html ng-app="netStatsApp"> from that server).

Could you explain why you think there is a firewall issue currently? From what I can see it is successfully accessing that endpoint.

pro100skm commented 1 month ago

Yeah, now I see it. Let's wait for dev's response from core team

benjamin202410 commented 1 month ago

Hi, this seems very slow to me too, could you try to restart and see if you have same issue?

pro100skm commented 1 month ago

the same after a while

VSGic commented 1 month ago

the same

nikki-quant commented 1 month ago

Restarting the process did not help and I could not see any evidence the node was resource constrained with top, lsof, vmstat or similar tools, but our node eventually began syncing somewhat quicker (1k blocks/minute) and caught up without us making changes.

I'm still not sure if this was to do with a lack of peers on the newer version or some other factor. No changes to underlying infrastructure I made while debugging seemed to have an effect. This gives me some concern about continuing to maintain a node since it's not clear to me how to effectively debug and resolve issues with XDC software stack.

XinFinOrg / XinFin-Node

Slow performance syncing with xdposchain v2.2.4 #162