Support External Indexer via gRPC AcceptedSubscriber + Optional Internal Indexing

aaronbuchwald commented 1 month ago

The HyperSDK should enable VMs to index relevant data either in the node (for simple setup and quickly launching a network w/ everything they need) or using an external indexer that subscribes to the node in order to reliably process every block.

https://github.com/ava-labs/hypersdk/pull/1143 introduces a simple interface to subscribe to all blocks accepted by the HyperSDK and https://github.com/ava-labs/hypersdk/issues/1145 fixes a previous bug where the HyperSDK would not guarantee at least once delivery of accepted blocks.

The HyperSDK should make it as easy as possible for a basic data ingestion pipeline to process accepted blocks and spit out the relevant static data to index (and display in an explorer).

External Subscriber

We should implement a gRPC service that implements the AcceptedSubscriber logic. This would consist of a sidecar that runs a server and listens on a given port, so that a user can configure the HyperSDK with a gRPC client that will dial the server and push all accepted blocks to the server.

The sidecar would export a listening address, which could then be passed into the HyperSDK APIs via config:

{
    "exportedBlockSubscribers": "localhost:9001"
}

The external subscriber then needs to guarantee that:

1) It sends an acknowledgement back to the HyperSDK after it's processed each block (allow HyperSDK to clean up the block and continue processing the accepted queue) 2) Ensure block operation is idempotent to ensure it handles multiple deliveries correctly

Standard Indexer to optionally Serve APIs within HyperSDK

This proposes a small change from the original ethos of the HyperSDK to do the absolute minimum inside of the node. Instead, this proposes to support an optional set of APIs to get started as quickly as possible.

This should include at least block and transaction indexing:

block height -> block bytes index
blockID -> block height
blockID -> block bytes lookup
txID -> txBytes
results (TODO: this can be implemented in the future if results indexing becomes a clear requirement for getting started)

This can then support the basic APIs - GetTransaction, GetBlockByHeight and GetBlock.

Transform Block to Static Data

Export a function to convert from a block to the relevant JSON for the block including all of the included transactions.

This can be as simple as:

var _ AcceptedSubscriber = (*pipeline)(nil)

type pipeline struct {
    blockProcessor BlockProcessor
    txProcessor txProcessor
}

func (p *pipeline) Accepted(ctx context.Context, blk *chain.StatelessBlock) error {
    blockJSON, _ := json.Marshal(blk)
    txs := blockJSON["txs"]

    if err := p.blockProcessor.ProcessBlock(ctx, blk); err != nil {
        return err
    }

    return p.txProcessor.ProcessTxs(ctx, txs)
}

For an external indexer, this would then be wrapped with the gRPC AcceptedSubscriber server and send an ACK back to the HyperSDK once it's successfully indexed the block and transactions.

aaronbuchwald commented 1 month ago

Linking a few relevant issues here:

Export Block/Tx/State Diffs to External Store

https://github.com/ava-labs/hypersdk/issues/961

I think the best way to support block/tx indexing is with this accepted subscriber pattern. Exporting state diffs would be a change from the current interface that we could optionally support if needed. Depending on the VM and use case, this may push a lot of data, so I'd prefer to export state diffs to an external store if the need arises rather than prioritizing it and changing the code to support it now.

If this is completed w/o exporting state diffs to an external store, we should open a new GitHub issue for state diffs as a potential future improvement.

Support S3 Archiver

https://github.com/ava-labs/hypersdk/issues/531 https://github.com/ava-labs/hypersdk/pull/697

This would be great to support. To avoid feature bloat in the HyperSDK, I'd prefer an S3 archiver implementation to be implemented as an external service to the HyperSDK using the gRPC Accepted Subscriber.

gbartolome-avax commented 1 month ago

Per Slack conversation with @aaronbuchwald Data Platform would archive HyperSDK payloads using a similar pattern we use now when we ingest our EVM based subnets into our Data Lake

Chain Ingestion - On-Chain Producer

It will depend on a subscriber based producer/consumer push pattern.

parent consumer will subscribe to HyperSDK messages and push messages to a messaging stream, in this case Kafka
child consumers will subscribe to HyperSDK Kafka topic for all payloads

ava-labs / hypersdk