coollabsio / sentinel

An experimental API for your Linux server
https://coolify.io
Apache License 2.0
73 stars 3 forks source link

`Error getting container metrics: json: unsupported value: NaN` and have some high CPU usage spikes with COolify #4

Open MatteoGauthier opened 3 months ago

MatteoGauthier commented 3 months ago

On my coolify instance, i've enabled Sentinel and I noticed high CPU usages spikes and lots of error logs for the sentinel container

Capture d’écran 2024-06-27 à 01 03 59

livghit commented 3 months ago

Hey , I've inspected the code a bit and noticed this one happen in the getOneContainerMetrics func

func getOneContainerMetrics(containerID string, csv bool) (string, error) {
    ctx := context.Background()
    apiClient, err := client.NewClientWithOpts()
    if err != nil {
        return "", err
    }
    apiClient.NegotiateAPIVersion(ctx)
    defer apiClient.Close()
    metrics := ContainerMetrics{
        CPUUsagePercentage:    0,
        MemoryUsagePercentage: 0,
        MemoryUsed:            0,
        MemoryAvailable:       0,
        NetworkUsage:          NetworkDevice{},
    }
    container, err := apiClient.ContainerInspect(ctx, containerID)
    if err != nil {
        return "", err
    }
    stats, err := apiClient.ContainerStats(ctx, container.ID, false)
    if err != nil {
        return "", err
    }
    var v types.StatsJSON
    dec := json.NewDecoder(stats.Body)
    if err := dec.Decode(&v); err != nil {
        if err != io.EOF {
            fmt.Printf("Error decoding container stats: %v\n", err)
        }
    }
    defer stats.Body.Close()
    network_devices := v.Networks
    for _, device := range network_devices {
        metrics.NetworkUsage = NetworkDevice{
            Name:    device.InstanceID,
            RxBytes: device.RxBytes,
            TxBytes: device.TxBytes,
        }
    }

    metrics = ContainerMetrics{
        Time:                  getUnixTimeInMilliUTC(),
        CPUUsagePercentage:    calculateCPUPercent(v),
        MemoryUsagePercentage: calculateMemoryPercent(v),
        MemoryUsed:            calculateMemoryUsed(v),
        MemoryAvailable:       v.MemoryStats.Limit,
        NetworkUsage:          metrics.NetworkUsage,
    }
    jsonData, err := json.MarshalIndent(metrics, "", "    ")
    if err != nil {
        return "", err
    }
    if csv {
        return fmt.Sprintf("%s,%f,%d,%f\n", metrics.Time, metrics.CPUUsagePercentage, metrics.MemoryUsed, metrics.MemoryUsagePercentage), nil
    }
    return string(jsonData), nil
}

I think one of the calculation may be the reason this happen , but I am not sure .

metrics = ContainerMetrics{
        Time:                  getUnixTimeInMilliUTC(),
        CPUUsagePercentage:    calculateCPUPercent(v),
        MemoryUsagePercentage: calculateMemoryPercent(v),
        MemoryUsed:            calculateMemoryUsed(v),
        MemoryAvailable:       v.MemoryStats.Limit,
        NetworkUsage:          metrics.NetworkUsage,
    }
livghit commented 3 months ago

Tested the whole thing locally and inside docker . I wasn't able to reproduce you're error .... 🥲

mutonby commented 3 months ago

I have the same error, can it be deactivated or something? Captura de pantalla 2024-07-08 a las 14 15 04

Rhiz3K commented 3 months ago

Same today after 307 update, per message through discord disabling metrics helped image

andrasbacsai commented 3 months ago

In coolify v312, I disabled Sentinel on all servers until this bug (and a few others) are not fixed.