Introduce runners - Githubissues

fracek commented 1 year ago

Is your feature request related to a problem? Please describe.

As seen in #226 and #227, we need a way to dynamically spinup indexers. This system needs to be robust enough for production usage, and support different deployment models (Kubernetes, Docker, bare metal).

Describe the solution you'd like

The idea is to introduce schedulers. A scheduler is a server (gRPC for now, + REST later) that exposes a CRUD interface for indexers, and schedules them to run in the background. The scheduler is also used to show indexing status (see #229) and logs.

Notice that the CreateIndexer operation is idempotent: if an indexer with the same indexer definition.id already exists, it will return the existing indexer without creating a new one.

See message below for a first draft of the implementation

The idea is to include at least two implementations of the scheduler:

one with no external dependencies used for development. it will be less robust, but for development it's good enough.
one based on kubernetes, that combined with the operator enables us to have a production ready setup.

After we have this, implementing #226 and #227 becomes easy:

226: for each indexer in the config, call CreateIndexer. No need to check if the indexer exists since operations are idempotent. apibara down will call DeleteIndexer.
227: the factory simply calls CreateIndexer with the returned value. This becomes a sink like any other.

Additional context

The idea to use an API to schedule factory indexers comes from a telegram chat with @bigherc18

bigherc18 commented 1 year ago

Overall it looks good to me, I have some remarks though

I wouldn't it call it scheduler, I'll prefer something like manager, a scheduler as it's name indicates will be expected to schedule tasks/processes in the future, as cron jobs, in a schedule ....
Why would go with RPC first ? This is a basic CRUD server, I'd say it's better to use REST in this case, it'd easier for us and other people to write plugins, tests ...

fracek commented 1 year ago

Agree, I think "runner" is a better name for this component. Let's use that.

I like gRPC because it's self documenting, it forces us to version the API following best practices, and we can generate clients automatically.

fracek commented 1 year ago

I'm going to sketch out the runner gRPC Service so that we can start working on an implementation. We follow Google's AIP guidelines when possible since they're well thought out, but we're not too strict.

Indexer resource

Terminology:

Resource Type: Indexer.
Collection Identifier: indexers.
Resource Id: the identifier specified by the user when creating the resource, e.g. my-indexer.
Resource Name: the combination of the collection identifier and resource id, e.g. indexers/my-indexer.

message Indexer {
    message Status {
    }

    // Resource name, e.g. `indexers/my-indexer`.
    string name = 1;

    // Additional labels attached to the indexer.
    // Useful to attach application-specific metadata.
    map<string, string> labels = 2;
}

Open Questions: will edit later.

How to specify the indexer script? In a cloud environment, the indexer must be downloaded in the container before running.
How to model parent-child indexers? When we delete an indexer, we need to delete all of its children as well.

Operations

Create

This method creates a new indexer if one with the same indexer_id doesn't exist. If it already exists, it simply returns the existing indexer.

Notice that if the client provides a value for fields that are set server-side (like name or status), they are simply ignored.

service IndexerRunner {
    rpc CreateIndexer(CreateIndexerRequest) returns (Indexer);
}

message CreateIndexeRequest {
    // Indexer id, e.g. `my-indexer`.
    string indexer_id = 1;
    Indexer indexer = 2;
}

Delete

This method deletes the indexer. If persistence for the indexers is configured, this method must also clear the indexer state from it.

service IndexerRunner {
    rpc DeleteIndexer(DeleteIndexerRequest) returns (google.protobuf.Empty);
}

message DeleteIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

Get

This method simply gets an indexer by its name.

service IndexerRunner {
    rpc GetIndexer(GetIndexerRequest) returns (Indexer);
}

message GetIndexerRequest {
    // Indexer name, e.g. `indexers/my-indexer`.
    string name = 1;
}

List

List all indexers according to some criteria. Returns a paginated list of indexers.

We implement filtering based on AIP 160. The idea is to use a string as filter to allow us to change filtering easily without breaking changes.

Since filtering is complex, we skip it at first.

service IndexerRunner {
    rpc ListIndexers(ListIndexersRequest) returns (ListIndexersResponse);
}

message ListIndexersRequest {
    // Number of indexers per page.
    int32 page_size = 1;
    // Continuation token.
    string page_token = 2;
    // Filter indexers.
    string filter = 3;
}

message ListIndexersResponse {
    repeated Indexer indexers = 1;
    string next_page_token = 2;
}

Stream Logs

This method returns a stream of logs for the indexer. It is an infinite stream of data since we expect the indexer to keep producing logs.

service IndexerRunner {
    rpc StreamLogs(StreamLogsRequest) returns (stream StreamLogsResponse);
}

enum LogLevel {
    LOG_LEVEL_UNKNOWN = 0;
    LOG_LEVEL_TRACE = 1;
    LOG_LEVEL_DEBUG = 2;
    LOG_LEVEL_INFO = 3;
    LOG_LEVEL_WARNING = 4;
    LOG_LEVEL_ERROR = 5;
}

message StreamLogsRequest {
    // The name of the indexer, e.g. `indexers/my-indexer`.
    string parent = 1;
    LogLevel level = 2;
}

message StreamLogsResponse {
    LogLevel level = 1;
    string content = 2;
}

Indexers persistence

The runner is responsible for setting up the indexers persistence. This is for several reasons:

developers want to configure this once and forget about it.
delete operations must clear the indexer state, so the runner must know about persistence anyway.

Runner persistence

In some cases, the runner needs to keep track of the indexers it created. I believe it would be easier if it can work with the same persistence as the sinks.

--persist-to-fs: stores data in the same folder as indexers. To keep it simple, it dumps the Indexer object sent by CreateIndexer as json to a file named <indexer-id>.indexer.
--persist-to-etcd: stores the content of the Indexer object sent by CreateIndexer to the database. The key should be something like indexers:<indexer-id> so that ListIndexer simply scans through this key.

Some runners (like the one based on Kubernetes) can use other persistence mechanism.

Other considerations

This service does not deal with authentication or authorization. Deployments that want to deal with it must create a facade service that adds authentication/authorization to this service.

fracek commented 1 year ago

Re: how to specify indexer script.

We add two new properties to the indexer:

project_source: this is the location of the project. Can be a directory (file:///path/to/dir) or a github url (github:fracek/my-indexer).
project_dir: the subfolder (if any) that contains the indexer script.

The indexer path is then compute as ${project_dir}/${script}, relative from the root of project source.

In practice

apibara up

Creates new indexers as defined in the configuration. project_source is set to the path of the folder containing the config file, and project_dir is empty.

indexer factory

By default, project_source is the current directory and project_dir is empty.

in both cases

When they call the api to create an indexer, they forwardsthe current project_source/project_dir. Ideally, we let users override source and dir for any indexer (so that they can deploy from a third party repository).

fracek commented 1 year ago

re: delete an indexer and its children

The easiest solution is to add a spawned_by property to the indexer, with the name/id of the indexer that spawned the current indexer.

On delete, the runner goes through all indexers where spawned_by is the current indexer and deletes them (recursively).

Note that we cannot use the name parent because according to AIP it's a different things.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had activity in the six months. It will be closed in 2 weeks if no further activity occurs. Please feel free to leave a comment if you believe the issue is still relevant.

apibara / dna

Introduce runners #230

226: for each indexer in the config, call `CreateIndexer`. No need to check if the indexer exists since operations are idempotent. `apibara down` will call `DeleteIndexer`.

227: the factory simply calls `CreateIndexer` with the returned value. This becomes a sink like any other.

Indexer resource

Operations

Create

Delete

Get

List

Stream Logs

Indexers persistence

Runner persistence

Other considerations

In practice

apibara / dna

Introduce runners #230

226: for each indexer in the config, call CreateIndexer. No need to check if the indexer exists since operations are idempotent. apibara down will call DeleteIndexer.

227: the factory simply calls CreateIndexer with the returned value. This becomes a sink like any other.

Indexer resource

Operations

Create

Delete

Get

List

Stream Logs

Indexers persistence

Runner persistence

Other considerations

In practice

226: for each indexer in the config, call `CreateIndexer`. No need to check if the indexer exists since operations are idempotent. `apibara down` will call `DeleteIndexer`.

227: the factory simply calls `CreateIndexer` with the returned value. This becomes a sink like any other.