dotnet / aspire

Tools, templates, and packages to accelerate building observable, production-ready apps
https://learn.microsoft.com/dotnet/aspire
MIT License
3.86k stars 462 forks source link

Decouple the dashboard from the Aspire.Hosting during development #1003

Closed davidfowl closed 9 months ago

davidfowl commented 11 months ago

We want to treat the dashboard like it's just another orchestrated process in your application.

We want to do 2 things to move this forward:

  1. Make the dashboard run as an orchestrated resource. It's no longer special code running in process in the app host, it's out of process.
  2. Define an HTTP API that the dashboard talks to get information. Aspire.Hosting will expose this API in development.

To make 1. work, we would make a dotnet tools package for the dashboard. The app model would define a new "tools package resource" that would install the dotnet tool and then run it as an orchestrated process. This would mean the dashboard and its dependencies would be fully isolated and it would run like tool (this has other benefits as well). It would be configured with the HTTP end point to get information from (the running app host itself). It also means we would update the templates to have an AddAspireDashboard call on the IDistributedApplicationBuilder. No call, no dashboard.

For the secod, we need to define an HTTP API to enable the dashboard to get data about the model. We also need to take versioning into consideration. Right now, we're using the k8s API talking to DCP directly, and augmenting it with additional data from the DistributedApplicationModel. The API is a contract for the dashboard, not DCP or Aspire.Hosting and it should be modeled as such.

niltor commented 11 months ago

I'm using otel/loki/tempo/grafana to summarize and view telemetry information.

After the dashboard is detached, can I remove my existing scheme and send the telemetry data directly to the dashboard through some extension method?

In addition, if it exists as a dotnet tool and is defined as a tools package resource, then when publishing, you need to consider how to control the publishing behavior of this type of resource.

davidfowl commented 11 months ago

I'm using otel/loki/tempo/grafana to summarize and view telemetry information.

I think you want to keep doing that unless there's a reason not to.

After the dashboard is detached, can I remove my existing scheme and send the telemetry data directly to the dashboard through some extension method?

I'd suggest sending it to both places.

In addition, if it exists as a dotnet tool and is defined as a tools package resource, then when publishing, you need to consider how to control the publishing behavior of this type of resource.

Yes, I have another issue in progress about deploying the dashboard

josephaw1022 commented 11 months ago

Are there playwright tests in place for the ui dashboard?

davidfowl commented 11 months ago

No there isn't. Test coverage for the dashboard is manual at this point

smitpatel commented 11 months ago

In the decoupled world the dashboard will run out of process from AppHost. The dashboard will act as a client/receiver will which receive data about resources in the aspire and display them in UI on dashboard (including OTel data). Since dashboard also update data about resources in realtime, it will need data regarding resource being pushed to it once the connection is established rather than polling mechanism. Initial plan is to explore grpc as an option through which dashboard will talk to "server" and receive data regarding resources. The "server" is supposed to provide data to dashboard regarding all the resources. In local development environment it will be endpoint exposed by Hosting project. This will also allow dashboard to be integrated with anything which can provide information e.g. in deployment environment. The communication contract between dashboard and server will be responsible for providing all basic details about resources (including custom resources), their current execution status, logs associated with resources. While server may not provide data regarding OTel configuration but it will provide endpoints to get that data from individual app (similar to current setup).

niltor commented 11 months ago

In the decoupled world the dashboard will run out of process from AppHost. The dashboard will act as a client/receiver will which receive data about resources in the aspire and display them in UI on dashboard (including OTel data). Since dashboard also update data about resources in realtime, it will need data regarding resource being pushed to it once the connection is established rather than polling mechanism. Initial plan is to explore grpc as an option through which dashboard will talk to "server" and receive data regarding resources. The "server" is supposed to provide data to dashboard regarding all the resources. In local development environment it will be endpoint exposed by Hosting project. This will also allow dashboard to be integrated with anything which can provide information e.g. in deployment environment. The communication contract between dashboard and server will be responsible for providing all basic details about resources (including custom resources), their current execution status, logs associated with resources. While server may not provide data regarding OTel configuration but it will provide endpoints to get that data from individual app (similar to current setup).

It sounds good! I hope I can use it freely because monitoring is a common requirement for programs. Currently, using solutions like OTLP/loki/tempo/grafana or ELK can be relatively cumbersome to use and deploy.

When OpenTelemetry is used as the infrastructure, it should make everything easier. Many times, I don't need Grafana's ability to connect to various data sources. I just need the ability to monitor my applications and services.

With this approach, even if I'm not using the Aspire workload, I can still integrate into the Dashboard in my existing programs (through a library, a line of code, and configuring the connection string). I will have no burden and will be more willing to use it.

For production environments, additional requirements are basic role authorization verification and user management capabilities.

davidfowl commented 11 months ago

For production environments, additional requirements are basic role authorization verification and user management capabilities.

This is one thing we don’t have plans to solve yet. Nor long term storage of telemetry information.

smitpatel commented 11 months ago

Given dashboard is just display rather than storing any information, account management would be unnecessary to add just to control access. For local development which is scope of this issue, it doesn't matter since same user account which is running apphost can see the details. For deployment scenario, we should just probably utilize access control over other resources. Like if you have access to see your services deployed then your dashboard will have data about it. Perhaps extra item API contract would take regarding authorization, though it would still be up to server what mechanism to use and how to provide access.

DamianEdwards commented 11 months ago

I think for now, authorization of access to the dashboard is out of scope of the dashboard itself, and instead should be delegated to the environment the dashboard is running in, e.g.:

smitpatel commented 11 months ago

Draft of grpc API spec between dashboard and AppHost It is missing log streaming API, that require some more thought. Started a discussion on that.

syntax = "proto3";

option csharp_namespace = "Aspire.Dashboard";

package Aspire.Dashboard;

// If optional doesn't work then we need to use wrappers for nullable

import "google/protobuf/timestamp.proto";

message ApplicationInformationRequest {
}

message ApplicationInformationResponse {
    string application_name = 1;
    string application_version = 2;
}

message GetResourcesRequest {
}

message EnvironmentVariableViewModel {
    string name = 1;
    optional string value = 2;
    bool is_value_masked = 3;
    bool from_spec = 4;
}

message ResourceService {
    string name = 1;
    optional string allocated_address = 2;
    optional int32 allocated_port = 3;
}

message ContainerData {
    string image = 1;
    optional string container_id = 2;
    repeated int32 ports = 3;
}

message ExecutableData {
    optional int32 process_id = 1;
    optional string executable_path = 2;
    optional string working_directory = 3;
    repeated string arguments = 4;
}

message ProjectData {
    optional int32 process_id = 1;
    string project_path = 2;
}

message Endpoint {
    string address = 1;
}

enum ResourceType {
    custom = 0;
    container = 1;
    executable = 2;
    project = 3;
}

message ResourceViewModel {
    string name = 1;
    ResourceType resource_type = 2;
    string uid = 3;
    optional string state = 4;
    optional google.protobuf.Timestamp creation_time_stamp = 5;
    repeated EnvironmentVariableViewModel environment_view_model = 6;
    optional int32 expected_endpoints_count = 7;
    repeated Endpoint endpoints = 8;
    repeated ResourceService services = 9;
    oneof additional_data {
        ContainerData container_data = 10;
        ExecutableData executable_data = 11;
        ProjectData project_data = 12;
    }
}

message ResourcesList {
    int32 sequence_num = 1;
    repeated ResourceViewModel items = 2;
}

message WatchResourcesRequest {
    int32 sequence_num = 1;
}

enum ChangeKind {
    other = 0;
    added = 1;
    modified = 2;
    deleted = 3;
}

message ResourceChangeResult {
    ChangeKind change_kind = 1;
    ResourceViewModel resource = 2;
}

service IDashboardViewModelService {
    rpc GetApplicationInformation(ApplicationInformationRequest) returns (ApplicationInformationResponse);
    rpc GetResources(GetResourcesRequest) returns (ResourcesList);
    rpc WatchResources(WatchResourcesRequest) returns (stream ResourceChangeResult);
    // API for log streaming is not here yet
}
drewnoakes commented 11 months ago

What's the significance of sequence_num in this API?

drewnoakes commented 11 months ago

There's a data race in that service definition.

service IDashboardViewModelService {
    rpc GetResources(GetResourcesRequest) returns (ResourcesList);
    rpc WatchResources(WatchResourcesRequest) returns (stream ResourceChangeResult);
}

Calling GetResources and WatchResources cannot happen at exactly the same time, so an update might be missed.

We can fix with a slightly different API:

// Models a snapshot of resource state
message WatchResourcesSnapshot {
    repeated ResourceViewModel items = 1;
}

message WatchResourcesChange  {
    ChangeKind kind = 1;
    ResourceViewModel value = 2;
}

message WatchResourcesUpdate {
    // The first update contains the full initial snapshot.
    // All future updates are changes relative to that snapshot.
    oneof value {
        WatchResourcesSnapshot initial_snapshot = 1;
        WatchResourcesChange change = 2;
    }
}

service IDashboardViewModelService {
    rpc WatchResources(WatchResourcesRequest) returns (stream WatchResourcesUpdate);
}

In this way, the server implementation can ensure the client always has correct state.

smitpatel commented 11 months ago

sequence_num is that coordination piece to avoid race condition. When you run a watch you provide sequence num based on the snapshot you received in first call. If you put num as 0 then you get all the changes from start. This way allow watch to reconnect if connection drops.

drewnoakes commented 11 months ago

Ok I see.

Upside of sequence number is reconnects don't have to sync everything. Downside is that the server needs to remember deltas and it's not clear when it's safe to discard them.

JamesNK commented 11 months ago

Endpoint vs ResourceService

Can these be merged into one collection? I added the services collection because the tracing wants to know about all exposed ports. But it feels like there is a lot of overlap with endpoints.

Merging them:

JamesNK commented 11 months ago

EnvironmentVariableViewModel -> EnvironmentVariable

Remove the term view model from the API. It's a UI thing. Data from the API may be used in the UI, but don't bake that into the service and message names.

JamesNK commented 11 months ago

IDashboardViewModelService -> DashboardService

I'm sure this name will be bike shed a lot.

JamesNK commented 11 months ago

package Aspire.Dashboard;

I propose package aspire.v1;

drewnoakes commented 11 months ago

With feedback:

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

syntax = "proto3";

package aspire.v1;

// TODO get optional -> nullable mapping working during C# codegen

import "google/protobuf/timestamp.proto";

////////////////////////////////////////////

message ApplicationInformationRequest {
}

message ApplicationInformationResponse {
    string application_name = 1;
    string application_version = 2;
}

////////////////////////////////////////////

message ResourceCommandRequest {
    string command_type = 1;
    string resource_name = 2;
    string resource_type = 3;
    optional string confirmation_message = 4;
}

enum ResourceCommandResponseKind {
    none = 0;
    succeeded = 1;
    failed = 2;
    cancelled = 3;
}

message ResourceCommandResponse {
    ResourceCommandResponseKind kind = 1;
    optional string error_message = 2;
}

////////////////////////////////////////////

message ResourceType {
    // Unique name for the resource type. Not for display.
    string unique_name = 1;

    // Localised name for the resource type, for display use only.
    optional string display_name = 2;

    // Any commands that may be executed against resources of this type, avoiding
    // the need to copy the value to every Resource instance.
    //
    // Note that these commands must apply to matching resources at any time.
    //
    // If the set of commands changes over time, use the "commands" property
    // of the Resource itself.
    repeated ResourceCommandRequest commands = 3;
}

////////////////////////////////////////////

message EnvironmentVariable {
    string name = 1;
    optional string value = 2;
    bool is_value_masked = 3;
    bool is_from_spec = 4;
}

message Endpoint {
    string name = 1;
    optional string allocated_address = 2;
    optional int32 allocated_port = 3;
    optional string http_address = 4;
}

message StringArray {
    repeated string values = 1;
}

message AdditionalData {
    // TODO do we need separate display value(s)?
    string name = 1;
    // Optional namespace, e.g. "container", "executable", "project", ...
    optional string namespace = 2;
    // A single value will be most common. Also support lists, to avoid escaping.
    oneof kind {
        string value = 3;
        StringArray values = 4;
    }
}

message ResourceId {
    string uid = 1;
    // TODO do we need resource_type to make unique names? if not, inline ResourceId type as string.
    string resource_type = 2;
}

// Models the full state of an resource (container, executable, project, etc) at a particular point in time.
message ResourceSnapshot {
    ResourceId resource_id = 1;
    string display_name = 2;
    optional string state = 3;
    optional google.protobuf.Timestamp created_at = 4;
    repeated EnvironmentVariable environment = 5;
    optional int32 expected_endpoints_count = 6;
    repeated Endpoint endpoints = 7;
    repeated ResourceCommandRequest commands = 8;

    // List of additional data, as name/value pairs.
    // For:
    // - Containers: image, container_id, ports
    // - Executables: process_id, executable_path, working_directory, arguments
    // - Projects: process_id, project_path
    repeated AdditionalData additional_data = 9;
}

////////////////////////////////////////////

// Models a snapshot of resource state
message WatchResourcesSnapshot {
    repeated ResourceSnapshot resources = 1;
    repeated ResourceType types = 2;
}

////////////////////////////////////////////

message ResourceDeletion {
    ResourceId resource_id = 1;
}

message WatchResourcesChange  {
    oneof kind {
        ResourceDeletion delete = 1;
        ResourceSnapshot upsert = 2;
    }
}

message WatchResourcesChanges {
    repeated WatchResourcesChange value = 1;
}

////////////////////////////////////////////

// Sent periodically from the server to prevent proxies closing the connection due to inactivity,
// and to allow the client to detect when the connection drops, notify the user and attempt to reconnect.
message Heartbeat {
    // Time until the next heartbeat.
    int32 interval_milliseconds = 1;
}

////////////////////////////////////////////

// Initiates a subscription for data about resources.
message WatchResourcesRequest {
    // True if the client is establishing this connection because a prior one closed unexpectedly.
    optional bool is_reconnect = 1;
}

// A message received from the server when watching resources. Has multiple types of payload.
message WatchResourcesUpdate {
    oneof kind {
        // Snapshot of current resource state. Received once upon connection, before any "changes".
        WatchResourcesSnapshot initial_snapshot = 1;
        // One or more deltas to apply.
        WatchResourcesChanges changes = 2;
        // A sign of life from the server. Can arrive at any time.
        Heartbeat heartbeat = 3;
    }
}

////////////////////////////////////////////

service DashboardService {
    rpc GetApplicationInformation(ApplicationInformationRequest) returns (ApplicationInformationResponse);
    rpc WatchResources(WatchResourcesRequest) returns (stream WatchResourcesUpdate);
    rpc ExecuteResourceCommand(ResourceCommandRequest) returns (ResourceCommandResponse);
}
smitpatel commented 11 months ago

We need more structured messages for known things.

drewnoakes commented 11 months ago

Could you be more specific @smitpatel?

drewnoakes commented 11 months ago

Moving the design of this API into PR now: https://github.com/dotnet/aspire/pull/1274

drewnoakes commented 10 months ago

1476 introduces the use of gRPC in-proc between app host and dashboard.

davidfowl commented 10 months ago

@prom3theu5 We'll need a resource server for aspire8 for k8s as well if we want the dashboard to work there. We're going to be producing a container image for the dashboard, but it needs to connect to another resource server that can show the k8s resources that are relevant to aspire in the dashboard using the grpc contract above. I'll open an issue on aspir8 for this.

prom3theu5 commented 10 months ago

@prom3theu5 We'll need a resource server for aspire8 for k8s as well if we want the dashboard to work there. We're going to be producing a container image for the dashboard, but it needs to connect to another resource server that can show the k8s resources that are relevant to aspire in the dashboard using the grpc contract above. I'll open an issue on aspir8 for this.

Makes sense - cheers David. I think this will tick a lot of boxes, seems to be lots of call in the community to have the dashboard beyond dev. why not - its good :)