CPU exhaustion with address index HTTP API queries

braydonf commented 6 years ago

Versions

bcoin v1.0.2

Description

HTTP Endpoints:

/coin/address/:address
/tx/address/:address

Using an address with thousands or millions of coins or transactions will overload these APIs and will not respond with a result (didn't wait to know how long), and will max out the CPU for quiet awhile.

Details

Taking a look at the address index layout:

 *  T[addr-hash][hash] -> dummy (tx by address)
 *  C[addr-hash][hash][index] -> dummy (coin by address)

The txids are sorted lexicographic by the txid/hash. For single address queries, it should be possible to introduce an ?after=<txid>&limit=<number> query argument to be able to jump forward in the query in subsets. Note: Sorting by hash may not be desirable, however that could be handled with additional indexes, or by adding [count] or [height][index] with the block height and transaction index to sort it in that order.

The exposed API only uses a single address, however the code is written to handle multiple addresses. It is problematic to optimize to query subsets with multiple addresses. A query is made to the address index to find matching txids. However the final order isn't known without querying all of the results for each address. To be able to have subsets, that order needs to be known, and be able to skip to various locations quickly. So the entire set of txids is queried each time, and then filling in only a portion of the txids. As this query is repeated often, it would be best if it were already organized on disk and memory, in that order. This optimization can't be made when there is near endless number of combinations of addresses that isn't known ahead of time. A wallet is capable of making these optimizations.

A solution could be to write the associated methods to only handle a single address and then be able to query subsets for larger results. And document to use a wallet API for any multiple addresses query. There may also be ways to use an address index internally to speed up wallet re-scans.

bucko13 commented 6 years ago

Nice! This seems useful and also would go well with the #559 tx pagination issue/PR. Ideally, we can do this without breaking the existing API but if the exposed API isn't making use of the multiple address handling maybe it's ok and this fix would be mostly around the update to how these are stored in the db. Seems related to the updates proposed in #585.

braydonf commented 6 years ago

For changes around how data is stored in the DB, it would be better to organize and sort it in the way it's going to be queried often, chronological order is generally the order. So including a count, or height|index, as mentioned, would be an improvement there. It would be good to have a use case for it, and that could be with wallet rescans.

Another option would be to deprecate these endpoints in favor of using the equivalent wallet API with imported addresses. A single address could be used for a wallet for that use case, and the API would be similar between single and multiple address wallets, as well as calculate balances and etcetera.

Having an address index sorted chronologically would support the use case of using the index for wallet rescans, a use case as mentioned earlier, avoiding the need to read thousands of blocks during a wallet rescan. This may also be related to BIP157/BIP158 and the use of filters for blocks there, as I think there is a similar function.

braydonf commented 5 years ago

Something also worth considering, BIP157 filters may be very useful aiding wallet rescans also, without an address index.

This could be useful for performing a wallet rescan on a pruned node, where the filters are checked for which blocks are relevant and only those are downloaded and then eventually pruned again.

nodech commented 5 years ago

Will be nice to have general pagination for indexers in general. Currently indexers are just similar to what TXDB was doing, but if we could provide some basic structure for pagination in indexers, I believe many other indexers will be able to easily add endpoints that don't cause CPU Exhaustion.

Like: #605 @tuxcanfly

braydonf commented 5 years ago

Yep, with a change to add count to the db keys here, it will sort the transactions in the order they are queried, and making it possible to jump to various positions within the larger set quickly.

Example:

 *  T[addr-hash][tx-count][hash] -> dummy (tx by address)
 *  C[addr-hash][coin-count][hash][index] -> dummy (coin by address)

To support quering using an after parameter (e.g. after=txhash), an additional index can be added, to find the count for a transaction for an address.

Example:

 *  X[addr-hash][hash] -> tx-count
 *  X[addr-hash][hash] -> coin-count

However, calculating the balance for the address would require summing up all the coins, and then staying syned with all the changes. It would be inefficient to periodically scan all coins to find those that have been spent to calculate the balance. Importing a single address (and multiple) as a watch only wallet, would provide the balance and pagination of history, which is another option.

It may also be possible to add a balance the index specifically for single address use, so that all coins don't need to be queried for that case.

Example:

 *  B[addr-hash] -> balance

Another option which is to query for the deltas of coins through ranges of blocks, so that clients syncing would only ever need to query for the history they don't already have. The query would include a summary of coins spent and coins newly available. However there may be some downsides to how reorgs are handled on the client in this case.

Would need to dig into the code more to see what would be the best option.

nodech commented 5 years ago

Is after=txhash useful? I would argue block height or time is better indexing option than txhash. Yeah, we can have balance (or indexer can implement it that way) -- We will just wrap this in two methods that will handle balance calculation and additional database writes that's necessary.

braydonf commented 5 years ago

Using block height can be problematic because there can be many transactions that may be greater than the page size. It could be an entry point though, similar to how time is handled in #605 where it points to the count and uses that to iterate, and then the after parameter with the txid to anchor each page. However, that would be provided in #605 with a watch only wallet (still need to look more into that code to verify).

nodech commented 5 years ago

I think there's value having tx and address indexer, but they should provide same pagination API as wallet should do, like you described it(at least address indexer). Having to import all coins I am interested in watching sometimes does not make sense, also you have to do reindexing if you want to recover all the history.

braydonf commented 5 years ago

Wallet rescanning/resetting/reindexing to recover the history could be accelerated using an address index (so there may still be a use case there), so as far as speed of importing, there wouldn't be a difference. So I think it comes down to use cases for the API, here are some examples:

Creating and getting history for a watch-only wallet for a single address with many txs:

PUT /wallet/:id
POST /wallet/:id/import
GET /wallet/:id/balance
GET /wallet/:id/tx/history

For multiple addresses, all with many of txs:

PUT /wallet/:id
POST /wallet/:id/import
POST /wallet/:id/import
POST /wallet/:id/import
GET /wallet/:id/balance
GET /wallet/:id/tx/history

An HD wallet with many txs:

PUT /wallet/:id
GET /wallet/:id/balance
GET /wallet/:id/tx/history

All of these examples could work for full, pruned and spv nodes. Applications then can query the API the same way regardless with how much historical data is stored. The speed at which it would recovery history would depend:

For a full node with an address index, it could be very fast, as it would be able to seek specific transactions from blocks rather than it being necessary to read full blocks at all.
For a pruned node, assuming there is a blockfilter index (see https://github.com/bcoin-org/bcoin/pull/370), it would be slower as blocks would need to be requested from peers, and transactions scanned.
Similarly, for spv, transactions would need to be requested from peers.

For full archival nodes, there could be automation that would automatically add all addresses to a wallet, for example the API could be:

GET /wallet/*/tx/address/:address

That would be the equivalent of querying a full archive node with:

GET /tx/address/:address

nodech commented 5 years ago

Sure, it can work :) I was talking about addr indexer, instead of removing it we can fix cpu exhaustion

braydonf commented 5 years ago

Yeah, I agree. There is a use case, even without the /tx/address/:address endpoint, and CPU exhaustion would still an issue in those cases.

pinheadmz commented 5 years ago

@braydonf is this addressed in #605 ?

braydonf commented 5 years ago

@pinheadmz It's not addressed in #605, this is specifically for the node API w/ address index, unrelated to the wallet.

braydonf commented 5 years ago

So I think there is a case for the address index for transactions, however having looked into this more, I think it should also be worth considering removing the address index of coins.

For a wallet building on the address index, all that is necessary are the transactions, those can then be used to generate a coins database. For example this is how bwallet functions. This is preferrable as only the latest transactions are necessary. In the case with coins, the coins can be removed from the result at any point, so a wallet needs to request all coins every time to know which coins are no longer available.

bcoin-org / bcoin

CPU exhaustion with address index HTTP API queries #589