One in ten peer TMGetLedger requests do not return a response

donovanhide commented 10 years ago

Hi,

I'm pulling the ledger and associated transactions via the peer network. I'm issuing TMGetLedger requests with a specified LedgerSeq field and the TMLedgerInfoType set to liBase. These commands are sent to, on average 18 peers with qualifying ledger ranges, at a combined rate of 20 requests/second. That is, roughly one ledger request per client per second.

I'm finding that approximately 1 in 10 of these requests never receives a response. It's not particular to any one other node. Is this by design? Dropping requests that the node does not want to handle, or is it a bug that I need to trace? I'm currently patching the gaps by reissuing the requests later, which works fine.

Cheers!

donovanhide commented 10 years ago

Guess these lines answer my question:

https://github.com/ripple/rippled/blob/8daecb543066040f27b2868e117a4f64677aaff0/src/ripple_overlay/impl/PeerImp.h#L2263-L2267

which seems to trace back to one or more of the items in the job queue satisfying this algorithm:

https://github.com/ripple/rippled/blob/a865149c6551b198ecded8d794df5fe908942bbe/src/ripple_core/functional/LoadMonitor.cpp#L193

Not sure what the magic constant 4 represents, other than 4 times over the desired limit. Would be interesting to trace these overload moments to see if the server was definitely overloaded... Just seems a bit weird that the failure rate is constant at roughly 10%.

JoelKatz commented 10 years ago

What are you trying to accomplish exactly? Why are you making so many peers fetch so many ledgers? Whatever information you're trying to get, there's probably a sensible way to get it, and that isn't it. You're lucky any servers are replying to you at all.

donovanhide commented 10 years ago

What are you trying to accomplish exactly?

I'm writing a thin client that operates on the peer network and listens to the activity of nodes acting on it and records the ledger, transactions and account states in an efficient manner. The peer protocol is suggested as a means of achieving this type of goal:

https://ripple.com/wiki/API_Overview

I'm doing this to open up the ledger for detailed analysis and browse-ability by the Ripple community. Currently the rippled implementation takes too long to acquire the full ledger locally (many months). The data is key to understanding trends, trading activity, payment paths and effectively deciding on market-making and arbitrage strategies. The ripplecharts data isn't granular enough for some, if not most, use cases.

Why are you making so many peers fetch so many ledgers?

The advantage of a peer to peer network is surely to distribute the load of an operation over as many computers as possible (cf. bitcoin, bittorrent, gnutella, et al) and ensure resilience of the payloads. My code distributes the cost of downloading the ledger over as many machines as possible. It requests any single ledger from a single node. If, after a period of time, the ledger is not received it tries another node. It does not make all peers return all ledgers, if that was your assumption. It only requests ledgers that the node advertises as holding in its TMHello message and subsequent TMStatusChange messages.

Whatever information you're trying to get, there's probably a sensible way to get it, and that isn't it.

Well, the ledger_data command didn't exist when I became frustrated at the time required to acquire the full ledger and the associated storage requirements of rippled, so I chose this path. Sockets are much more efficient than HTTP header parsing for every request and websockets are less amenable to connecting to many clients. Also there is no API for acquiring the full peer network to connect to, other than through the peer protocol.

You're lucky any servers are replying to you at all.

Why? I've closely followed the rippled code and am not sending any spurious data to the network. The fetch pack method of synchronising appears inefficient to me, sending the account state deltas for every ledger. As far as I can see, all the information required to reconstruct the ledger is contained in the metadata section of the transaction leaf nodes and I'm not requesting any of the account state nodes, which is a much,much larger tree. The amount of work that a rippled node has to do to respond to my code's requests is less than another rippled node making a fetch pack request.

It could be more efficient if there were TMGetLedgerRange and a TMGetTransactions messages...

I strongly believe that Ripple will benefit from alternative implementations making use of the peer protocol and those alternatives being open source as well. I fully intend to open source the majority of my code once that is functional and I am confident that it is not a toolkit for others to DDOS the network. If this means that you have to formalise the peer/sockets API, is that a bad thing? I've submitted bug reports and errant node behaviour as a result of my work, surely this is useful too.

I hope that explains the context and motivation of what I'm doing.

All the best!

As an addendum to be crystal clear of the code's process:

A bitset is created of the ledger range (currently 32,570-5,6XX,XXX). All ledgers not present in the database are requested in batches of a 1000. Each request is routed to a single node with a qualifying advertised ledger range availability. The requests are rate-limited to n/second. If the node responds with the ledger, the transactions state is recursed using TMGetObjectsByHash for each level of the tree (not very deep compared to the account state tree - which is not requested). These requests are rate-limited also. The requests are batched to contain all hashes for the previous inner node in one request. For each successful ledger and associated transaction set retrieved the bit in the bitset is set, and not requested again.

JoelKatz commented 7 years ago

If the server is not going to reply to you, it simply doesn't reply. The query logic rippled uses handles this case correctly. The peer ledger exchange logic is built on the assumption that servers are peers and are sharing work equally. It's not designed to tolerate a peer that makes other peers do all the work. You should issue such queries to your own rippled that's participating in the network, not to other people's rippleds that have no obligation to do extra work for you.

There are really only two alternatives. A server could never ever be allowed to refuse a request. That's an obvious non-starter. Alternatively, a server could be required to always send an active rejection of a request if it wasn't going to reply to it. That would add overhead in particularly the case where the server is trying to minimize overhead. And the intention of delaying the response is to force the queries to back off (to time out waiting for it). It doesn't seem to make sense to shift this work to the server that receives the query when the whole intention of the design is to put as much work on the querier as possible.

donovanhide commented 7 years ago

The peer network protocol is undocumented, so I gave up on this some years ago.

XRPLF / rippled

One in ten peer TMGetLedger requests do not return a response #311