XRPLF / rippled

Decentralized cryptocurrency blockchain daemon implementing the XRP Ledger protocol in C++
https://xrpl.org
ISC License
4.52k stars 1.47k forks source link

[rippled keeps server_status:disconnected] (Version: [1.12.0]) #4777

Closed sj602 closed 1 year ago

sj602 commented 1 year ago

Issue Description

rippled is not synching forever.

Steps to Reproduce

sudo systemctl start rippled.service

Expected Result

rippled synchronizes properly.

Actual Result

when I typed sudo rippled server_info, I found server_state: disconnected and there's no additional logs that's being added in real-time in /var/log/rippled/debug.log

Environment

OS: CentOS Linux release 7.9.2009 (Core) rippled version: v1.12.0

Supporting Files

sudo rippled server_info

[irteamsu@LNELIBNODE1509 ripple]$ sudo rippled server_info
Loading: "/etc/opt/ripple/rippled.cfg"
2023-Oct-19 07:58:20.351256291 UTC HTTPClient:NFO Connecting to 127.0.0.1:5005

{
   "result" : {
      "info" : {
         "build_version" : "1.12.0",
         "closed_ledger" : {
            "age" : 430,
            "base_fee_xrp" : 1e-05,
            "hash" : "D6E5C532F475DFDCFE4FA7E93AAE8D5A961A8A8C6ECB13E5F784332A28724907",
            "reserve_base_xrp" : 10,
            "reserve_inc_xrp" : 2,
            "seq" : 2
         },
         "complete_ledgers" : "empty",
         "hostid" : "LNELIBNODE1509",
         "io_latency_ms" : 1,
         "jq_trans_overflow" : "0",
         "last_close" : {
            "converge_time_s" : 0,
            "proposers" : 0
         },
         "load" : {
            "job_types" : [
               {
                  "in_progress" : 1,
                  "job_type" : "clientRPC"
               }
            ],
            "threads" : 14
         },
         "load_factor" : 1,
         "network_ledger" : "waiting",
         "node_size" : "huge",
         "peer_disconnects" : "0",
         "peer_disconnects_resources" : "0",
         "peers" : 0,
         "ports" : [
            {
               "port" : "5005",
               "protocol" : [ "http" ]
            },
            {
               "port" : "51235",
               "protocol" : [ "peer" ]
            },
            {
               "port" : "6006",
               "protocol" : [ "ws" ]
            }
         ],
         "pubkey_node" : "n9JbZCeCwMinZkuKFTWgXFaEmMcT12Wdtb1jsLru3WumwHQSUnht",
         "pubkey_validator" : "none",
         "published_ledger" : "none",
         "server_state" : "disconnected",
         "server_state_duration_us" : "431857943",
         "state_accounting" : {
            "connected" : {
               "duration_us" : "1000825",
               "transitions" : "1"
            },
            "disconnected" : {
               "duration_us" : "431952490",
               "transitions" : "2"
            },
            "full" : {
               "duration_us" : "0",
               "transitions" : "0"
            },
            "syncing" : {
               "duration_us" : "0",
               "transitions" : "0"
            },
            "tracking" : {
               "duration_us" : "0",
               "transitions" : "0"
            }
         },
         "time" : "2023-Oct-19 07:58:20.351744 UTC",
         "uptime" : 432,
         "validation_quorum" : 4294967295,
         "validator_list" : {
            "count" : 2,
            "expiration" : "2023-Oct-24 00:00:00.000000000 UTC",
            "status" : "active"
         }
      },
      "status" : "success"
   }
}

/var/log/rippled/debug.log

2023-Oct-19 07:51:07.409987623 UTC Application:DBG startUp: 5
2023-Oct-19 07:51:07.440685883 UTC OrderBookDB:WRN Eliding full order book update: no ledger
2023-Oct-19 07:51:07.441025559 UTC ValidatorList:DBG Loading configured trusted validator list publisher keys
2023-Oct-19 07:51:07.441043647 UTC ValidatorList:WRN Duplicate validator list publisher key: ED2677ABFFD1B33AC6FBC3062B71F1E8397C1505E1C42C64D11AD1B28FF73F4734
2023-Oct-19 07:51:07.441052804 UTC ValidatorList:DBG Loaded 2 keys
2023-Oct-19 07:51:07.441059832 UTC ValidatorList:DBG Loading configured validator keys
2023-Oct-19 07:51:07.441066643 UTC ValidatorList:DBG Loaded 0 entries
2023-Oct-19 07:51:07.441074293 UTC ValidatorSite:DBG Loading configured validator list sites
2023-Oct-19 07:51:07.441197501 UTC ValidatorSite:DBG Loaded 3 sites
2023-Oct-19 07:51:07.492984113 UTC NetworkOPs:NFO Consensus time for #3 with LCL D6E5C532F475DFDCFE4FA7E93AAE8D5A961A8A8C6ECB13E5F784332A28724907
2023-Oct-19 07:51:07.493014093 UTC ValidatorList:DBG 0  of 0 listed validators eligible for inclusion in the trusted set
2023-Oct-19 07:51:07.493024284 UTC ValidatorList:DBG Using quorum of 18446744073709551615 for new set of 0 trusted validators (0 added, 0 removed)
2023-Oct-19 07:51:07.493031432 UTC ValidatorList:WRN New quorum of 18446744073709551615 exceeds the number of trusted validators (0)
2023-Oct-19 07:51:07.493040309 UTC LedgerMaster:DBG No validated ledger
2023-Oct-19 07:51:07.493047877 UTC NetworkOPs:NFO STATE->connected
2023-Oct-19 07:51:07.493058846 UTC LedgerConsensus:NFO Entering consensus process, watching, synced=no
2023-Oct-19 07:51:07.493069813 UTC LedgerConsensus:DBG transitioned to ConsensusPhase::open
2023-Oct-19 07:51:07.493077910 UTC LedgerConsensus:NFO Consensus mode change before=observing, after=observing
2023-Oct-19 07:51:07.493086072 UTC NetworkOPs:DBG Initiating consensus engine
2023-Oct-19 07:51:07.493448928 UTC Server:NFO Opened 'port_rpc_admin_local' (ip=0.0.0.0:5005, admin nets:0.0.0.0/0, ::/0, http)
2023-Oct-19 07:51:07.493497357 UTC Server:NFO Opened 'port_peer' (ip=0.0.0.0:51235, peer)
2023-Oct-19 07:51:07.493543234 UTC Server:NFO Opened 'port_ws_admin_local' (ip=127.0.0.1:6006, admin nets:127.0.0.1/32, ws)
2023-Oct-19 07:51:08.493879093 UTC NetworkOPs:WRN Node count (0) has fallen below required minimum (1).

/opt/ripple/etc/rippled.cfg

[server]
port_rpc_admin_local
port_peer
port_ws_admin_local

[port_rpc_admin_local]
port = 5005
ip = 0.0.0.0
admin = 0.0.0.0
protocol = http

[port_peer]
port = 51235
ip = 0.0.0.0
protocol = peer

[port_ws_admin_local]
port = 6006
ip = 127.0.0.1
admin = 127.0.0.1
protocol = ws

[node_size]
huge

[node_db]
type=NuDB
path=/var/lib/rippled/db/nudb
open_files=2000
filter_bits=12
cache_mb=256
file_size_mb=8
file_size_mult=2
online_delete=51200
advisory_delete=0

[ssl_verify]
0
[validator_list_sites]
https://vl.ripple.com

[validator_list_keys]
ED2677ABFFD1B33AC6FBC3062B71F1E8397C1505E1C42C64D11AD1B28FF73F4734

[sntp_servers]
time.windows.com
time.apple.com
time.nist.gov
pool.ntp.org

[validators_file]
validators.txt

[rpc_startup]
{ "command": "log_level", "severity": "warning" }

[database_path]
/var/lib/rippled/db

[debug_logfile]
/var/log/rippled/debug.log

/opt/ripple/etc/validators.txt

[validator_list_sites]
https://vl.ripple.com
https://vl.xrplf.org

[validator_list_keys]
ED2677ABFFD1B33AC6FBC3062B71F1E8397C1505E1C42C64D11AD1B28FF73F4734
ED45D1840EE724BE327ABE9146503D5848EFD5F38B6D5FEDE71E80ACCE5E6E738B
intelliot commented 1 year ago

See https://xrpl.org/server-doesnt-sync.html

What hardware are you running on?

sj602 commented 1 year ago

@intelliot I don't think my hardware is the problem because I'm running rippled in different machines with the same hardware. It occurred when I tried to install a new rippled (1.12.0), other different machines running 1.11.0 are working fine.

CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz Memory: 32GB Disk: 2TB

sj602 commented 1 year ago

I guess the node can't get any peer information.

$ rippled peers
Loading: "/etc/opt/ripple/rippled.cfg"
2023-Oct-24 02:34:30.852130053 UTC HTTPClient:NFO Connecting to 127.0.0.1:5005

{
   "result" : {
      "cluster" : {},
      "peers" : null,
      "status" : "success"
   }
}
intelliot commented 1 year ago

Sounds like a connectivity issue. Can you try downgrading the rippled binary on the problematic machine to 1.11.0, and see if it syncs successfully? If it does not, then there is a problem with that machine, not rippled 1.12.0.

sj602 commented 1 year ago

I tried downgrading to the original version (1.11.0) but it didn't work out, either. And then I rebooted the machine and tried 1.12.0/1.11.0 but all failed.

/var/log/rippled/debug.log is not showing any additional/helpful logs to track this issue, So is there a way to leave a log what step the rippled is stopping on? Like modifying some files in rippled?

dangell7 commented 1 year ago

@sj602

First I noticed that you have duplicate entries for validator_list_keys and validator_list_sites. I don't think it matters but I would use the validators_file and remove those from your config or you're going to forget which one is the correct setting.

Second you can use debug trace to get a better output.

sj602 commented 1 year ago

@dangell7

I modified it. Thanks! but it didn't solve it.

How can I debug trace? I'd appreciate it if you give me any documents for it.

sj602 commented 1 year ago

Oh, I'm not sure this is related to the issue but one thing I tried is to capture all network packets transmitted to the port 5005 like below.

[irteamsu@LNELIBNODE1509 ripple]$ sudo /sbin/tcpdump -i any -s 0 'host 127.0.0.1 and port 5005'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

but I guess there's no single packet that's being transmitted

dangell7 commented 1 year ago

Just to confirm, you're running the same config on all the servers right?

I dont have any documents on it but if you could put it into a gist and paste it here I can try to help.

Also make sure you remove the db when restarting

sj602 commented 1 year ago

Just to confirm, you're running the same config on all the servers right?

Yes.

I dont have any documents on it but if you could put it into a gist and paste it here I can try to help.

If I understand correctly, do you want me to upload the whole /var/log/rippled/debug.log file in gist? You can check the part of it in the above, though.

Also make sure you remove the db when restarting

Yes. my install script contains the removal of db directory.

where is your ips or ips_fixed stanza?

I don't know what they mean but I guess they are only for testnet. I can't find any settings for ips and ips_fixed for my rippled.cfg

sj602 commented 1 year ago

FYI. my install script: here my config: here

sj602 commented 1 year ago

I guess my node can not get any peer information

[irteamsu@LNELIBNODE1509 ~]$ rippled peers
Loading: "/etc/opt/ripple/rippled.cfg"
2023-Oct-26 03:59:27.456110923 UTC HTTPClient:NFO Connecting to 127.0.0.1:5005

{
   "result" : {
      "cluster" : {},
      "peers" : null,
      "status" : "success"
   }
}
sj602 commented 1 year ago

I changed the log_level from warning to trace in rippled.cfg

[rpc_startup]
{ "command": "log_level", "severity": "trace" }

and then I restarted rippled, found this in /var/log/rippled/debug.log. 3.122.10.154:51235 seems like a peer address and it throws Connection reset by peer.

...
2023-Oct-26 06:24:28.557590197 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:29.557643361 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:29.557676760 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:30.557727672 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:30.557762832 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:30.558912876 UTC PeerFinder:DBG Logic connect     1 boot address
2023-Oct-26 06:24:30.558934113 UTC Resource:DBG New outbound endpoint 3.122.10.154:51235
2023-Oct-26 06:24:30.558946476 UTC PeerFinder:DBG Logic connect     3.122.10.154:51235
2023-Oct-26 06:24:30.558986105 UTC Peer:DBG [183] Connect 3.122.10.154:51235
2023-Oct-26 06:24:30.804778831 UTC Peer:TRC [183] onConnect
2023-Oct-26 06:24:31.052363809 UTC Peer:DBG [183] onHandshake: Connection reset by peer
2023-Oct-26 06:24:31.052388839 UTC Peer:DBG [183] Closed
2023-Oct-26 06:24:31.052410712 UTC PeerFinder:DBG Bootcache failed  3.122.10.154:51235 with 6 attempts
2023-Oct-26 06:24:31.052421933 UTC Peer:TRC [183] ~ConnectAttempt
2023-Oct-26 06:24:31.052444542 UTC Resource:DBG Inactive 3.122.10.154:51235
2023-Oct-26 06:24:31.557800320 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:31.557829841 UTC JobQueue:TRC Doing heartbeatjob
2023-Oct-26 06:24:32.557862709 UTC JobQueue:DBG addRefCountedJob : Adding job : NetOPs.heartbeat : 34
2023-Oct-26 06:24:32.557892630 UTC JobQueue:TRC Doing heartbeatjob
ximinez commented 1 year ago

3.122.10.154 is one of the IPs in the r.ripple.com default bootstrap pool.

If you're not able to connect to it, there are only three causes I can think of:

  1. A networking problem.
  2. A configuration error on your server that is causing the remote server to reject the connection.
  3. A proxy server or something similar that is dropping the connection.

Fortunately, the peer port health method is available on that server. See if you can connect to it from the command line

curl -k https://3.122.10.154:51235/health

You should get back {"info":{}}. If not, post the result here.

sj602 commented 1 year ago

@ximinez both curl and ping failed.

[irteamsu@LNELIBNODE1509 ~]$ curl -k https://3.122.10.154:51235/health
curl: (35) TCP connection reset by peer

[irteamsu@LNELIBNODE1509 ~]$ ping 3.122.10.154
PING 3.122.10.154 (3.122.10.154) 56(84) bytes of data.
^C
--- 3.122.10.154 ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9007ms

but these also failed on the other servers which are currently working OK.

intelliot commented 1 year ago

If you can't reach r.ripple.com then you can add some other reachable nodes in [ips] or [ips_fixed]. This can even include other servers that you are running, as long as they are on the same network (Mainnet). They do not need to be running the same version of rippled.

sj602 commented 1 year ago

@intelliot I guess it's related to our ACL. I expect it to be working OK if I add peer node's ips and ports but I think there is some possibility that their IPs and ports will change in the future. I'll report the result sometime soon.

sj602 commented 1 year ago

After adding a new outbound ACL(my node -> *:51235), it's working fine. As I know, there were no updates on any ACL things in this v1.12.0 update so I don't why it made a problem all of a sudden.

sj602 commented 1 year ago

Is there any newly added peer while the network is upgrading v1.11.0 -> v1.12.0?

intelliot commented 1 year ago

Not that I know of.