> We've made several changes to this code with `rippled 1.1.1` which I believe should address this issue. I'm closing it, but if you continue having problems, please feel free to reopen.

hippo-dalaoshe commented 5 years ago

We've made several changes to this code with rippled 1.1.1 which I believe should address this issue. I'm closing it, but if you continue having problems, please feel free to reopen.

excuse me, I meet the same problem New quorum of 18446744073709551615 exceeds the number of trusted validators

the detail info in debug.log is as follow:

2019-Jan-14 23:37:36.676261277 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 16698ms
2019-Jan-14 23:37:36.676282620 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 19580ms
2019-Jan-14 23:37:36.676443416 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 13696ms
2019-Jan-14 23:37:36.676523252 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 7694ms
2019-Jan-14 23:37:36.676569343 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 10696ms
2019-Jan-14 23:37:36.676671027 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 1693ms
2019-Jan-14 23:37:36.676684432 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 4692ms
2019-Jan-14 23:45:31.170064459 Peer:WRN [325056] onReadMessage: Connection reset by peer
2019-Jan-14 23:47:25.541382117 LoadMonitor:WRN Job: processLedgerData run: 10764ms wait: 0ms
2019-Jan-14 23:47:25.541496212 LoadMonitor:WRN Job: InboundLedger run: 0ms wait: 8367ms
2019-Jan-14 23:47:25.541525938 LoadMonitor:WRN Job: processLedgerData run: 10360ms wait: 0ms
2019-Jan-14 23:47:25.573421814 LoadMonitor:WRN Job: InboundLedger run: 32ms wait: 7866ms
2019-Jan-14 23:47:25.573432698 LoadMonitor:WRN Job: InboundLedger run: 32ms wait: 4829ms
2019-Jan-14 23:51:13.438445475 LoadMonitor:WRN Job: processLedgerData run: 1389ms wait: 0ms
2019-Jan-14 23:51:13.438548683 LoadMonitor:WRN Job: processLedgerData run: 0ms wait: 1229ms
2019-Jan-14 23:51:13.438597187 LoadMonitor:WRN Job: processLedgerData run: 1448ms wait: 0ms
2019-Jan-14 23:51:27.901913617 Peer:WRN [325098] onReadMessage: Connection reset by peer
2019-Jan-15 00:00:00.235820248 ValidatorList:WRN New quorum of 18446744073709551615 exceeds the number of trusted validators (0)
2019-Jan-15 00:00:22.243129908 NetworkOPs:WRN We are not running on the consensus ledger
2019-Jan-15 00:00:22.243207871 ValidatorList:WRN New quorum of 18446744073709551615 exceeds the number of trusted validators (0)
2019-Jan-15 00:00:22.243263837 LedgerConsensus:WRN Need consensus ledger 4059DF5BA121A44C3F19AD366E9B77A375385D8AA0892CE255E8AACDCFAAB123
2019-Jan-15 00:00:23.247307864 LedgerConsensus:WRN View of consensus changed during open status=open,  mode=wrongLedger
2019-Jan-15 00:00:23.247352930 LedgerConsensus:WRN 4059DF5BA121A44C3F19AD366E9B77A375385D8AA0892CE255E8AACDCFAAB123 to DDFBFD8C6E372C52A1787557E5236EE0444A25255D16CF900B5F4EC4B1D890FC
2019-Jan-15 00:00:23.247404692 LedgerConsensus:WRN {"accepted":true,"account_hash":"B13D1AD764F1DD3253127A3DA02C6141B9093A2A7986B8E9940CE9143C995767","close_flags":0,"close_time":600825620,"close_time_human":"2019-Jan-15 00:00:20.000000000","close_time_resolution":10,"closed":true,"hash":"DDFBFD8C6E372C52A1787557E5236EE0444A25255D16CF900B5F4EC4B1D890FC","ledger_hash":"DDFBFD8C6E372C52A1787557E5236EE0444A25255D16CF900B5F4EC4B1D890FC","ledger_index":"16090546","parent_close_time":600825601,"parent_hash":"E7FA10FF2E422A41A5B4437422029B9265C6B60491A3C6043F182CCB32D50160","seqNum":"16090546","totalCoins":"99997075910029940","total_coins":"99997075910029940","transaction_hash":"0000000000000000000000000000000000000000000000000000000000000000"}

you can see that, before 2019-Jan-14 23:51:13 everything is ok, and my node can receive and persist the closed ledger normally. but at 2019-Jan-14 23:51:27 the Peer:WRN [325098] onReadMessage: Connection reset by peer message happen, and then at 2019-Jan-15 00:00:00 it tell me New quorum of 18446744073709551615 exceeds the number of trusted validators, and at 2019-Jan-15 00:00:23, my node is at View of consensus changed during open status=open, mode=wrongLedger, after that my node can only receive closed ledger but cannot sync it.

my node is run at ubuntu18.04 in testnet with rippled 1.1.1 I have restart the node, and lose all the ledger I have sync, can you tell me how to avoid this problem?

Originally posted by @hippo-dalaoshe in https://github.com/ripple/rippled/issues/2611#issuecomment-456638545

nbougalis commented 5 years ago

Can you please show us the output of the server_info command (ref)?

hippo-dalaoshe commented 5 years ago

Can you please show us the output of the server_info command (ref)?

I have restart, and thing seems to be normal... I only remember when this problem happen, my server_info show that the server_state is connected, and complete_ledger is at 14160000-16090545 but in fact, the latest ledger is about 16150000.. And I find the same problem New quorum of 18446744073709551615 exceeds the number of trusted validators happen on my mainnet node, the complete_ledger show there is some ledger lost. I think there is consensus problem happen or ledger miss out, but I cannot reproduce it right now..

hippo-dalaoshe commented 5 years ago

Can you please show us the output of the server_info command (ref)?

I think my explanation has confused you.
I have meet two errors when run rippled server, one hapeens at the testnet, and the other happens at the mainnet. Both of them result in ledgers missing, so my application depending on the sequence ledger, cannot work normally. I think my rippled server of both testnet and mainnet are configured correct, the problem is at the ledger process during consensus or persist.

I check the debug.log of both of them, they all have the error info like this New quorum of 18446744073709551615 exceeds the number of trusted validators. And in testnet, the rippled server can only receive new ledger from peer but cannot consensus and persist it to the database, the server_info command show that the state is at connected and completed_ledger is at range 14160000-16090545 which is far behind the latest ledger seq about 16150000

And in mainnet, the rippled server seems to continue run in full state, but the completed_ledgers look like this 43185555-43301323, 43301327-4340****, there are not only one interval, and between interval, there are some ledger missing.

Now I have restart both the testnet and mainnet server. In testnet, the rippled server restore immediatelly, and the complete_ledgers is restarted from the almost latest ledger seq, now it is at 16166627-16377448.

But unfortunately, in mainnet, it has took almost two day to restarted, but now, it is still at connected, the detail server_info of it is bellow:

{
   "result" : {
      "info" : {
         "build_version" : "1.1.1",
         "closed_ledger" : {
            "age" : 5,
            "base_fee_xrp" : 1e-05,
            "hash" : "9B651C8AB97DA84D38C942E25F080B9258BCECF2675F07FCE3A0B97012C84525",
            "reserve_base_xrp" : 200,
            "reserve_inc_xrp" : 50,
            "seq" : 17635
         },
         "complete_ledgers" : "empty",
         "hostid" : "77b7488105af",
         "io_latency_ms" : 1,
         "jq_trans_overflow" : "0",
         "last_close" : {
            "converge_time_s" : 5.007,
            "proposers" : 26
         },
         "load" : {
            "job_types" : [
               {
                  "job_type" : "untrustedProposal",
                  "peak_time" : 10,
                  "per_second" : 46
               },
               {
                  "in_progress" : 2,
                  "job_type" : "ledgerData",
                  "waiting" : 65
               },
               {
                  "in_progress" : 1,
                  "job_type" : "clientCommand"
               },
               {
                  "job_type" : "transaction",
                  "peak_time" : 5,
                  "per_second" : 15
               },
               {
                  "job_type" : "batch",
                  "per_second" : 6
               },
               {
                  "job_type" : "advanceLedger",
                  "peak_time" : 12,
                  "per_second" : 11
               },
               {
                  "job_type" : "fetchTxnData",
                  "peak_time" : 2,
                  "per_second" : 8
               },
               {
                  "job_type" : "trustedValidation",
                  "peak_time" : 13,
                  "per_second" : 4
               },
               {
                  "job_type" : "writeObjects",
                  "peak_time" : 6,
                  "per_second" : 4
               },
               {
                  "job_type" : "trustedProposal",
                  "peak_time" : 2,
                  "per_second" : 11
               },
               {
                  "avg_time" : 1,
                  "job_type" : "heartbeat",
                  "peak_time" : 2
               },
               {
                  "job_type" : "peerCommand",
                  "peak_time" : 1,
                  "per_second" : 693
               },
               {
                  "job_type" : "diskAccess",
                  "peak_time" : 5,
                  "per_second" : 4
               },
               {
                  "job_type" : "processTransaction",
                  "per_second" : 7
               },
               {
                  "job_type" : "AsyncReadNode",
                  "peak_time" : 93,
                  "per_second" : 1851
               }
            ],
            "threads" : 4
         },
         "load_factor" : 1,
         "peer_disconnects" : "18",
         "peer_disconnects_resources" : "0",
         "peers" : 10,
         "pubkey_node" : "n9J5DucjxQqSJaRWFPJcP7FqTfW8jiiJoQgbQ7nCert2HUrSHwr3",
         "pubkey_validator" : "none",
         "published_ledger" : "none",
         "server_state" : "connected",
         "state_accounting" : {
            "connected" : {
               "duration_us" : "74699859850",
               "transitions" : 1
            },
            "disconnected" : {
               "duration_us" : "1312716",
               "transitions" : 1
            },
            "full" : {
               "duration_us" : "0",
               "transitions" : 0
            },
            "syncing" : {
               "duration_us" : "0",
               "transitions" : 0
            },
            "tracking" : {
               "duration_us" : "0",
               "transitions" : 0
            }
         },
         "time" : "2019-Jan-25 01:57:26.578305",
         "uptime" : 74701,
         "validation_quorum" : 21,
         "validator_list" : {
            "count" : 1,
            "expiration" : "2019-Jan-31 00:00:00.000000000",
            "status" : "active"
         }
      },
      "status" : "success"
   }
}

I think it is because the local database is dirty, and cannot acquire the miss ledger from peers. I can only restart it by delete the local database manually. And my db config is bellow:

 [node_db]
type=RocksDB              
path=/var/rippled/lib/rippled/db/rocksdb
open_files=2000
filter_bits=12            
cache_mb=256
file_size_mb=8            
file_size_mult=2          
online_delete=200000      
advisory_delete=1         

[ledger_history]          
150000

I want to save the recently ledgers about two weeks in order to support my application. But these problems will cause the important ledger data miss and interrupt my application, can you give me some help to solve it?

mDuo13 commented 5 years ago

I suspect this was caused by lack of peer connections on the test net. If you don't have enough peer connections who are on the same net as you (i.e. not "insane") then it can be hard for your server to stay synced.

If you are behind a firewall and you don't open the peer protocol port (51235 by default), then you can only rely on outbound peers, who tend to be busier and may drop you as a peer. It's also possible you may end up connected to test net peers only when you want to be on the main net, or vice versa (look for "insane in the peers response)

So two actions you can take to reduce connectivity-related problems are:

Make sure you configure port forwarding in your firewall. (It can help to restart your server afterward, since that'll prompt it to look for new peers including ones who can now connect to you.) After you do this, you should see some "type": "in" peers in the Peer Crawler response.
Actively manage your peers, for example with a script such as rbh

Anyway, this issue hasn't been updated in a while, so I'm closing it as stale, but feel free to reopen if you are still having problems.

XRPLF / rippled

> We've made several changes to this code with `rippled 1.1.1` which I believe should address this issue. I'm closing it, but if you continue having problems, please feel free to reopen. #2835