coder / internal

Non-community issues related to coder/coder
2 stars 0 forks source link

Add agent registration timings #116

Open dannykopping opened 1 week ago

dannykopping commented 1 week ago

As part of https://github.com/coder/internal/issues/44, we should show when the agent registration occurred and its start/end time in this timeline. This gives users the ability to understand the delta between workspace provisioning completion and the agent startup, which is really an approximate measure of how quickly the compute instance spins up and executes the agent process and how long it takes to call back to the control-plane.

In https://codercom.slack.com/archives/CJURPL8DN/p1729196074736939, Spike and I discussed what would the definition for "agent registration" be. We settled upon: The time between an agent process starting and the successful completion of the call to /api/v2/workspaceagents/me/rpc. See here: https://github.com/coder/coder/blob/7da231bc9286060e7ad499fa38cb829c4b055a7d/agent/agent.go#L750-L759

The workspace_agents.first_connected_at column seems to satisfy this definition fairly well, and is set here: https://github.com/coder/coder/blob/5366f2576f690a3f7d3ac1a4efb8dd49cc2e9bd1/coderd/workspaceagentsrpc.go#L330-L338

workspace_agent.created_at is set here which is invoked once a provisioner job has been completed: https://github.com/coder/coder/blob/fac77f956ec28ad04c0c90750e381e69465d2ae1/coderd/provisionerdserver/provisionerdserver.go#L1782-L1784

Based on this, I think we can simply return the value of created_at (start) to first_connected_at (end) as a reasonable approximation of agent registration time.

Nit: do we want to persist with this "registration time" terminology or is "connection time" more clear?

BrunoQuaresma commented 6 days ago

Ok, so from what I understand, all the data we need is already there. We just need to return this data in the timing endpoint. Is that correct? For example:

return sdk.AgentTiming{
    StartedAt: a. CreatedAt,
    EndedAt: a.FirstConnectedAt,
    Agent: a.Name
}
johnstcn commented 6 days ago

@BrunoQuaresma that sounds about right to me based on the issue description! I would double-check that the time returned roughly matches with the time it takes to download and start the agent.

dannykopping commented 6 days ago

Hhmm maybe we need the agent download time as well, that would be very useful

BrunoQuaresma commented 6 days ago

@dannykopping so we want to display two timings for each agent, connection, and download times?

dannykopping commented 6 days ago

@dannykopping so we want to display two timings for each agent, connection, and download times?

I think so, WDYT?

BrunoQuaresma commented 4 days ago

I'm trying to figure out what is the best way to display the agent times based on this issue's discussion. I recorded a video explaining my doubts and thoughts a bit more.

https://github.com/user-attachments/assets/1a25c55f-a0f8-447e-9bb6-268ee413ec9e

chrifro commented 4 days ago

I'm for option two, then we are consistent with the patterns. The agent times are part of dev. So dev should have one bar including two smaller boxes. When you click on them, you can see the timings for connection and download times. For the colors, if there is no existing state like success or failed (as we have for start) we can use grey bars without a legend.

Hope that makes sense. Let me know if you prefer a mockup but I think we're aligned @BrunoQuaresma :)

dannykopping commented 4 days ago

OK, I think we need to take a step back here.

In the design you've got "start" and "dev"; I think I see why you're doing this (because one workspace can have many agents). Even so, the "start" section would have to be under "dev" since each agent has its own startup scripts.

Each agent will have (in order) timings for downloading, connecting, and executing startup scripts. Each should be individual lanes of the timing chart like we have already for "init", "plan", "graph", etc.

I think we should assume that 99% of the time workspaces will have a single agent, so let's keep the design simple by excluding the agent name in that case.

If there are multiple agents, we should repeat the "workspace boot" section and in that we could include the agent name.


On a related note, the main panel label "Provisioning time" is too specific, since it does not include the agent timings. I think we should rename this to "Build timeline" as I suggested here.

I also think we should be more clear and rename "workspace boot" to "agent initialization". The whole timeline is effectively the timings for workspace boot, so this will confuse folks.

BrunoQuaresma commented 17 hours ago

@dannykopping it makes sense. It looks like we don't have the download agent timing for now, so I will open a PR to include the connection timing and group it together with the start stage in the same section for each agent. It would look like the following:

Image

Wdyt?

dannykopping commented 12 hours ago

Perfect 👍

I think we should name the stages "run startup scripts" and "connect", to remain consist with the other stages (present tense verbs). The help text for the registration step could be "Time taken to establish an RPC connection with the control plane".