Closed sj602 closed 1 year ago
See https://xrpl.org/server-doesnt-sync.html
What hardware are you running on?
@intelliot I don't think my hardware is the problem because I'm running rippled in different machines with the same hardware. It occurred when I tried to install a new rippled (1.12.0), other different machines running 1.11.0 are working fine.
CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz Memory: 32GB Disk: 2TB
I guess the node can't get any peer information.
$ rippled peers
Loading: "/etc/opt/ripple/rippled.cfg"
2023-Oct-24 02:34:30.852130053 UTC HTTPClient:NFO Connecting to 127.0.0.1:5005
{
"result" : {
"cluster" : {},
"peers" : null,
"status" : "success"
}
}
Sounds like a connectivity issue. Can you try downgrading the rippled binary on the problematic machine to 1.11.0, and see if it syncs successfully? If it does not, then there is a problem with that machine, not rippled 1.12.0.
I tried downgrading to the original version (1.11.0) but it didn't work out, either. And then I rebooted the machine and tried 1.12.0/1.11.0 but all failed.
/var/log/rippled/debug.log
is not showing any additional/helpful logs to track this issue,
So is there a way to leave a log what step the rippled is stopping on?
Like modifying some files in rippled?
@sj602
First I noticed that you have duplicate entries for validator_list_keys
and validator_list_sites
. I don't think it matters but I would use the validators_file
and remove those from your config or you're going to forget which one is the correct setting.
Second you can use debug trace
to get a better output.
@dangell7
I modified it. Thanks! but it didn't solve it.
How can I debug trace
? I'd appreciate it if you give me any documents for it.
Oh, I'm not sure this is related to the issue but one thing I tried is to capture all network packets transmitted to the port 5005
like below.
[irteamsu@LNELIBNODE1509 ripple]$ sudo /sbin/tcpdump -i any -s 0 'host 127.0.0.1 and port 5005'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
but I guess there's no single packet that's being transmitted
Just to confirm, you're running the same config on all the servers right?
I dont have any documents on it but if you could put it into a gist and paste it here I can try to help.
Also make sure you remove the db
when restarting
Just to confirm, you're running the same config on all the servers right?
Yes.
I dont have any documents on it but if you could put it into a gist and paste it here I can try to help.
If I understand correctly, do you want me to upload the whole /var/log/rippled/debug.log
file in gist? You can check the part of it in the above, though.
Also make sure you remove the db when restarting
Yes. my install script contains the removal of db directory.
where is your ips or ips_fixed stanza?
I don't know what they mean but I guess they are only for testnet. I can't find any settings for ips
and ips_fixed
for my rippled.cfg
I guess my node can not get any peer information
[irteamsu@LNELIBNODE1509 ~]$ rippled peers
Loading: "/etc/opt/ripple/rippled.cfg"
2023-Oct-26 03:59:27.456110923 UTC HTTPClient:NFO Connecting to 127.0.0.1:5005
{
"result" : {
"cluster" : {},
"peers" : null,
"status" : "success"
}
}
I changed the log_level from warning
to trace
in rippled.cfg
[rpc_startup]
{ "command": "log_level", "severity": "trace" }
and then I restarted rippled, found this in /var/log/rippled/debug.log
.
3.122.10.154:51235
seems like a peer address and it throws Connection reset by peer
.
...
2023-Oct-26 06:24:28.557590197 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:29.557643361 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:29.557676760 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:30.557727672 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:30.557762832 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:30.558912876 UTC PeerFinder:DBG Logic connect 1 boot address
2023-Oct-26 06:24:30.558934113 UTC Resource:DBG New outbound endpoint 3.122.10.154:51235
2023-Oct-26 06:24:30.558946476 UTC PeerFinder:DBG Logic connect 3.122.10.154:51235
2023-Oct-26 06:24:30.558986105 UTC Peer:DBG [183] Connect 3.122.10.154:51235
2023-Oct-26 06:24:30.804778831 UTC Peer:TRC [183] onConnect
2023-Oct-26 06:24:31.052363809 UTC Peer:DBG [183] onHandshake: Connection reset by peer
2023-Oct-26 06:24:31.052388839 UTC Peer:DBG [183] Closed
2023-Oct-26 06:24:31.052410712 UTC PeerFinder:DBG Bootcache failed 3.122.10.154:51235 with 6 attempts
2023-Oct-26 06:24:31.052421933 UTC Peer:TRC [183] ~ConnectAttempt
2023-Oct-26 06:24:31.052444542 UTC Resource:DBG Inactive 3.122.10.154:51235
2023-Oct-26 06:24:31.557800320 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:31.557829841 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:32.557862709 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:32.557892630 UTC JobQueue:TRC Doing heartbeatjob
3.122.10.154
is one of the IPs in the r.ripple.com
default bootstrap pool.
If you're not able to connect to it, there are only three causes I can think of:
Fortunately, the peer port health method is available on that server. See if you can connect to it from the command line
curl -k https://3.122.10.154:51235/health
You should get back {"info":{}}
. If not, post the result here.
@ximinez both curl and ping failed.
[irteamsu@LNELIBNODE1509 ~]$ curl -k https://3.122.10.154:51235/health
curl: (35) TCP connection reset by peer
[irteamsu@LNELIBNODE1509 ~]$ ping 3.122.10.154
PING 3.122.10.154 (3.122.10.154) 56(84) bytes of data.
^C
--- 3.122.10.154 ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9007ms
but these also failed on the other servers which are currently working OK.
If you can't reach r.ripple.com
then you can add some other reachable nodes in [ips]
or [ips_fixed]
. This can even include other servers that you are running, as long as they are on the same network (Mainnet). They do not need to be running the same version of rippled
.
@intelliot I guess it's related to our ACL. I expect it to be working OK if I add peer node's ips and ports but I think there is some possibility that their IPs and ports will change in the future. I'll report the result sometime soon.
After adding a new outbound ACL(my node -> *:51235), it's working fine. As I know, there were no updates on any ACL things in this v1.12.0 update so I don't why it made a problem all of a sudden.
Is there any newly added peer while the network is upgrading v1.11.0 -> v1.12.0?
Not that I know of.
Issue Description
rippled is not synching forever.
Steps to Reproduce
sudo systemctl start rippled.service
Expected Result
rippled synchronizes properly.
Actual Result
when I typed
sudo rippled server_info
, I foundserver_state: disconnected
and there's no additional logs that's being added in real-time in/var/log/rippled/debug.log
Environment
OS:
CentOS Linux release 7.9.2009 (Core)
rippled version:v1.12.0
Supporting Files
sudo rippled server_info
/var/log/rippled/debug.log
/opt/ripple/etc/rippled.cfg
/opt/ripple/etc/validators.txt