apibara / dna

Apibara is the fastest platform to build production-grade indexers that connect onchain data to web2 services.
https://www.apibara.com/
Apache License 2.0
178 stars 32 forks source link

Introduce runners #230

Open fracek opened 1 year ago

fracek commented 1 year ago

Is your feature request related to a problem? Please describe.

As seen in #226 and #227, we need a way to dynamically spinup indexers. This system needs to be robust enough for production usage, and support different deployment models (Kubernetes, Docker, bare metal).

Describe the solution you'd like

The idea is to introduce schedulers. A scheduler is a server (gRPC for now, + REST later) that exposes a CRUD interface for indexers, and schedules them to run in the background. The scheduler is also used to show indexing status (see #229) and logs.

Notice that the CreateIndexer operation is idempotent: if an indexer with the same indexer definition.id already exists, it will return the existing indexer without creating a new one.

See message below for a first draft of the implementation

The idea is to include at least two implementations of the scheduler:

After we have this, implementing #226 and #227 becomes easy:

Additional context

The idea to use an API to schedule factory indexers comes from a telegram chat with @bigherc18

bigherc18 commented 1 year ago

Overall it looks good to me, I have some remarks though

fracek commented 1 year ago

Agree, I think "runner" is a better name for this component. Let's use that.

I like gRPC because it's self documenting, it forces us to version the API following best practices, and we can generate clients automatically.

fracek commented 1 year ago

I'm going to sketch out the runner gRPC Service so that we can start working on an implementation. We follow Google's AIP guidelines when possible since they're well thought out, but we're not too strict.

Indexer resource

Terminology:

message Indexer {
    message Status {
    }

    // Resource name, e.g. `indexers/my-indexer`.
    string name = 1;

    // Additional labels attached to the indexer.
    // Useful to attach application-specific metadata.
    map<string, string> labels = 2;
}

Open Questions: will edit later.

Operations

Create

This method creates a new indexer if one with the same indexer_id doesn't exist. If it already exists, it simply returns the existing indexer.

Notice that if the client provides a value for fields that are set server-side (like name or status), they are simply ignored.

service IndexerRunner {
    rpc CreateIndexer(CreateIndexerRequest) returns (Indexer);
}

message CreateIndexeRequest {
    // Indexer id, e.g. `my-indexer`.
    string indexer_id = 1;
    Indexer indexer = 2;
}

Delete

This method deletes the indexer. If persistence for the indexers is configured, this method must also clear the indexer state from it.

service IndexerRunner {
    rpc DeleteIndexer(DeleteIndexerRequest) returns (google.protobuf.Empty);
}

message DeleteIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

Get

This method simply gets an indexer by its name.

service IndexerRunner {
    rpc GetIndexer(GetIndexerRequest) returns (Indexer);
}

message GetIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

List

List all indexers according to some criteria. Returns a paginated list of indexers.

We implement filtering based on AIP 160. The idea is to use a string as filter to allow us to change filtering easily without breaking changes.

Since filtering is complex, we skip it at first.

service IndexerRunner {
    rpc ListIndexers(ListIndexersRequest) returns (ListIndexersResponse);
}

message ListIndexersRequest {
    // Number of indexers per page.
    int32 page_size = 1;
    // Continuation token.
    string page_token = 2;
    // Filter indexers.
    string filter = 3;
}

message ListIndexersResponse {
    repeated Indexer indexers = 1;
    string next_page_token = 2;
}

Stream Logs

This method returns a stream of logs for the indexer. It is an infinite stream of data since we expect the indexer to keep producing logs.

service IndexerRunner {
    rpc StreamLogs(StreamLogsRequest) returns (stream StreamLogsResponse);
}

enum LogLevel {
    LOG_LEVEL_UNKNOWN = 0;
    LOG_LEVEL_TRACE = 1;
    LOG_LEVEL_DEBUG = 2;
    LOG_LEVEL_INFO = 3;
    LOG_LEVEL_WARNING = 4;
    LOG_LEVEL_ERROR = 5;
}

message StreamLogsRequest {
    // The name of the indexer, e.g. `indexers/my-indexer`.
    string parent = 1;
    LogLevel level = 2;
}

message StreamLogsResponse {
    LogLevel level = 1;
    string content = 2;
}

Indexers persistence

The runner is responsible for setting up the indexers persistence. This is for several reasons:

Runner persistence

In some cases, the runner needs to keep track of the indexers it created. I believe it would be easier if it can work with the same persistence as the sinks.

Some runners (like the one based on Kubernetes) can use other persistence mechanism.

Other considerations

This service does not deal with authentication or authorization. Deployments that want to deal with it must create a facade service that adds authentication/authorization to this service.

fracek commented 1 year ago

Re: how to specify indexer script.

We add two new properties to the indexer:

The indexer path is then compute as ${project_dir}/${script}, relative from the root of project source.

In practice

apibara up

Creates new indexers as defined in the configuration. project_source is set to the path of the folder containing the config file, and project_dir is empty.

indexer factory

By default, project_source is the current directory and project_dir is empty.

in both cases

When they call the api to create an indexer, they forwardsthe current project_source/project_dir. Ideally, we let users override source and dir for any indexer (so that they can deploy from a third party repository).

fracek commented 1 year ago

re: delete an indexer and its children

The easiest solution is to add a spawned_by property to the indexer, with the name/id of the indexer that spawned the current indexer.

On delete, the runner goes through all indexers where spawned_by is the current indexer and deletes them (recursively).

Note that we cannot use the name parent because according to AIP it's a different things.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had activity in the six months. It will be closed in 2 weeks if no further activity occurs. Please feel free to leave a comment if you believe the issue is still relevant.