dotnet / aspire

Tools, templates, and packages to accelerate building observable, production-ready apps
https://learn.microsoft.com/dotnet/aspire
MIT License
3.89k stars 469 forks source link

Trace timeline seems wrong #841

Closed Kralizek closed 11 months ago

Kralizek commented 1 year ago

A page in a Razor Web app calls different endpoints of the same API.

The trace timeline seems wrong.

image
davidfowl commented 1 year ago

What does "seems wrong" mean? Can you provide more details?

Kralizek commented 1 year ago

My bad. I was into the flow and the wrongness seemed clear to me.

This is the OnGetAsync method of the razor web app

    public async Task<ActionResult> OnGetAsync(CancellationToken cancellationToken)
    {
        var something = await adminApi.Http.GetFromJsonAsync<Something>($"endpoint1", cancellationToken);

        if (provider is null)
        {
            return NotFound();
        }

        Something = something;

        Something2s = await adminApi.Http.GetFromJsonAsync<Something2[]>($"endpoint2", cancellationToken);

        Something3s = await adminApi.Http.GetFromJsonAsync<Something3[]>($"endpoint3", cancellationToken);

        Something4s = await adminApi.Http.GetFromJsonAsync<Something4[]>($"endpoint4", cancellationToken);

        return Page();
    }

Given this snippet, I had the assumption that the trace timeline had the following characteristics:

Here is my wonderful ASCII representation.

1st xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2nd      xxxxxxxxxxxx
3rd           xxxxx
4th                                 xxxxxxxxxxxxxxxx
5th                                      xxxxxxxxxx
6th                                                                   xxxxxxxxxxxxxxxxx
7th                                                                       xxxxxxxxxxx

I hope this is clearer.

davidfowl commented 1 year ago

No the trace looks correct to me, you have one parent, the incoming request, and 4 sibling requests (endpoints 1 to 4). Each of those have one child.

Edit: The first request does look strange.

cc @JamesNK

Kralizek commented 1 year ago

Edit: The first request does look strange.

To me also the right part of the time diagram (the bars). the name column looks correct.

danmoseley commented 1 year ago

@JamesNK thoughts?

JamesNK commented 1 year ago

I agree it's odd.

One possibility: You can have a situation where the root span for a trace isn't available yet, which means results can be unexpected.

For example:

  1. The server receives a HTTP request and starts a trace and a span.
  2. The server makes HTTP calls to other servers that creates new spans in the trace.
  3. The HTTP calls complete, the HTTP call spans complete, and they're sent via OTLP.
  4. The first HTTP request completes, and its span is sent via OTLP.

In the time period after 3, but before 4, there are child spans in the trace without a root span. It can look like a trace has multiple root spans.

However, I don't think that is the issue because the first two HTTP call spans are nested under the providers/{providerId:guid} span. Also, if there are multiple root spans, they're ordered by time to not appear out of order.

Another possibility is different apps have different clocks. I haven't seen that issue ever, and I don't think it is the case here because the providers/{providerId:guid} span and HTTP call spans are from the same app.

TLDR; I don't know. I'd be interested if there is a reproduction that I could look at.

Kralizek commented 1 year ago

I was hoping to get out a full representation of the whole trace but couldn't find how. (maybe a feature?)

Anyway, clicking on each row on the name column I got the following information (which confirms my hypothesis that the timeline graph is wrong)

# Name Service Duration Start time
1 providers/{providerId:guid} admin 18,35ms 0μs
2 HTTP GET admin 4ms 1,76ms
3 /providers/f0bc5c95-2d24-45fe-8d7e-9ca5bb724f57 api 3,55ms 2,12ms
4 HTTP GET admin 3,07ms 6,41ms
5 /providers/f0bc5c95-2d24-45fe-8d7e-9ca5bb724f57/courses api 2,64ms 6,73ms
6 HTTP GET admin 3,1ms 10,08ms
7 /providers/f0bc5c95-2d24-45fe-8d7e-9ca5bb724f57/sites api 2,78ms 10,33ms
8 HTTP GET admin 3,31ms 13,63ms
9 /providers/f0bc5c95-2d24-45fe-8d7e-9ca5bb724f57/users api 2,95ms 13,89ms

This is the timeline I got passing this data to GPT 4

image

Kralizek commented 12 months ago

@JamesNK @danmoseley @davidfowl Is there anything else other than a repro repo I can provide to help triage the issue?

DamianEdwards commented 12 months ago

I tried to reproduce this based on your description and got the result you expected:

image

Here's my Razor Page's model class:

public class IndexModel(ILogger<IndexModel> logger, AdminApiClient adminApi) : PageModel
{
    private string? provider = "something";

    public int Something { get; set; }
    public int Something2s { get; set; }
    public int Something3s { get; set; }
    public int Something4s { get; set; }

    public async Task OnGetAsync()
    {
        var something = await adminApi.GetApi1();

        if (provider is null)
        {
            return;
        }

        Something = something;

        Something2s = await adminApi.GetApi2();

        Something3s = await adminApi.GetApi3();

        Something4s = await adminApi.GetApi4();
    }
}

And the apis:

app.MapGet("/api1", async () =>
{
    await Task.Delay(100);
    return 1;
});

app.MapGet("/api2", async () =>
{
    await Task.Delay(100);
    return 2;
});

app.MapGet("/api3", async () =>
{
    await Task.Delay(100);
    return 3;
});

app.MapGet("/api4", async () =>
{
    await Task.Delay(100);
    return 4;
});

At this point I think we'd need a repro repo.

Kralizek commented 12 months ago

I'll put something together!

Kralizek commented 12 months ago

@DamianEdwards could you share your repro?

I tried running a "dotnet new aspire-starter" application and I got this timeline. See the first line.

image

repro:

$ md Aspire2
$ cd Aspire2
$ dotnet new aspire-starter
$ dotnet run --project .\Aspire2.AppHost\
DamianEdwards commented 12 months ago

https://github.com/DamianEdwards/AspireIssue841Repro

DamianEdwards commented 12 months ago

Is it possible this is a localization issue with the trace store sorting? I note that your culture is using , for thousands separators.

Kralizek commented 12 months ago

You might be onto something. This is what I got when I ran your repro fresh of checkout

image

That being said, the 0 μs should be 0 regardless of the digital separator.

JamesNK commented 12 months ago

I'm guessing the problem is here: https://github.com/dotnet/aspire/blob/eceed4d2bd168d5a969d90dd5608e213edd345d2/src/Aspire.Dashboard/Components/Pages/TraceDetail.razor#L95

The ToStrings should use invariant culture. The long-term solution is making sure culture code analysis is enabled (or maybe it is but doesn't run on Razor pages? IDK).

DamianEdwards commented 12 months ago

@kvenkatrajan would be good to get this one fixed in preview 2