elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.63k stars 8.22k forks source link

[Stack Monitoring] CPU usage rule should handle usage limit changes #160905

Open miltonhultgren opened 1 year ago

miltonhultgren commented 1 year ago

Following up on https://github.com/elastic/kibana/pull/159351

The CPU usage rule as it looks today is not able to accurate calculate the CPU usage in the case where a resource usage limit has changed within the rules look back window (either a limit has been added, or remove, or the set limit was changed to higher or lower). The current rule simply alerts when it detects this change, but we would ideally extend the rule to be able to handle this case. This means that the rule needs to be able to easily swap between the containerized and non-containerized calculation for the same node.

Handling the change is non-trivial but here are 3 options we can think of right now:

1. Split the look back window into two or more spans when a change is detected The rule already detects the change and could respond this this situation by making follow up queries that define the time ranges that apply for each setting (this could be many) and make follow up queries per time range, calculate the usage in each time range (using the appropriate calculation) and then take the average of those. This could be costly in processing time within the rule if there are more than two spans.

2. Use a date histogram to always get smaller time spans This offers a few sub-options, we could for example drop the exact buckets where the change happened but that requires that we have enough buckets that dropping a few would not greatly affect the average. Then for each remaining bucket we apply the appropriate calculation and take the average of the buckets. It's possible this could be done in part by Elasticsearch but most likely it will have to be done in Kibana. This path exposes us to scalability risks by asking Elasticsearch to do more work, potentially hitting the bucket limit and timing out the rule execution due to more processing being done. The current rule scales per cluster per node, which can partially be worked around by creating multiple instances of the rule where we filter for a specific cluster for example.

3. The long shot: Use Elasticsearch transforms to create data that is easy to alert on Underlying the problems the rule faces is a data format that is not easy to alert on. We could try to leverage a Transform to change the data into something that is easier to say yes/no for. The transform would do the work outlined in option 2 (roughly) and put the result into a document which the rule can consume, leaving the rule quick to execute since the hard work is amortized by ES. This is somewhat uncharted territory since we don't know if a transform can keep up in speed for this to not cause the rule to lag, it introduces more complexity in the setup and there is currently no way to install transforms as part of the alerting framework. So the SM plugin would have to own setting up and cleaning up such a transform and making sure the right permissions are available. Further, there are some doubts about the scalability about Transforms as well, specially for non aggregated data.

AC

elasticmachine commented 1 year ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

miltonhultgren commented 1 year ago

Thinking about it, my vote would be for option 2.

Especially if we can push a majority of the work into Elasticsearch, that path has benefits since it's only one query and the scaling issue can be addressed by having multiple instances of the rule with varying filters, leaving the over all tracing flow simpler.

bck01215 commented 1 year ago

The CPU usage rule seems to break noncontainerized environments.

image

We get this on all our nodes now

miltonhultgren commented 1 year ago

@bck01215 Can you explain more about your setup? Are you using containers? Are you using cgroups without containers? Do you have limits set? Have you configured monitoring.ui.container.elasticsearch.enabled? Are you monitoring both containerized and non-containerized workloads with the same Kibana instance? Can you share the result of GET /_nodes/_local/stats from the node in question?

That error means that Kibana is configured with monitoring.ui.container.elasticsearch.enabled set to false (which is the default) but that the nodes are reporting monitoring data which includes cgroup metrics for usage limits (which Elasticsearch only reports if it's running in a cgroup). In that case, the basic CPU metric is likely false since it doesn't account for the cgroup limits, hence why Kibana should be configured with monitoring.ui.container.elasticsearch.enabled set to true instead (since that changes how the rule computes the usage).

bck01215 commented 1 year ago

We did not have containers. This error came from updating from 8?4 to 8?10. It seemed that deleting ans recreating the rule fixed it

miltonhultgren commented 1 year ago

Interesting, perhaps there is/was something stored in the rule state that would affect the flow.

Anyway, I'm glad it was solved by re-creating the rule, don't hesitate to reach out again if any issues come up!

msafdal commented 1 year ago

Getting the same alert as @bck01215 triggered after the last few upgrades. Currently running 8.10.2 on both Elasticsearch and Kibana.

Kibana is configured for non-containerized workloads but node xxx has resource limits configured.

image

In my case removing and re-adding the rule did not resolve the issue.

I'm not running any containers, however I noticed that the systemd service mentions "CGroup". I've not touched any cgroup limits. It's installed "out-of-the-box" via apt on a fully patched Ubuntu 20.04 system. Might be a "false positive" depending on how the "containerization detection" is done maybe.

Operating System: Ubuntu 20.04.6 LTS Kernel: Linux 5.4.0-163-generic Architecture: x86-64

# systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
     Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/elasticsearch.service.d
             └─override.conf
     Active: active (running) since Thu 2023-09-28 10:45:02 CEST; 17min ago
       Docs: https://www.elastic.co
   Main PID: 2465 (java)
      Tasks: 159 (limit: 9425)
     Memory: 6.1G
     CGroup: /system.slice/elasticsearch.service
             ├─2465 /usr/share/elasticsearch/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=server -Dcli.script=/usr/share/elasticsearch/bin/elasticsearch -Dcli.libs=lib/tools>
             ├─2539 /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch ->
             └─2563 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
# cat /etc/systemd/system/elasticsearch.service.d/override.conf
[Service]
LimitMEMLOCK=infinity
# systemctl show elasticsearch.service

Type=notify
Restart=no
NotifyAccess=all
RestartUSec=100ms
TimeoutStartUSec=15min
TimeoutStopUSec=infinity
TimeoutAbortUSec=infinity
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=2465
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success
UID=111
GID=113
NRestarts=0
OOMPolicy=stop
ExecMainStartTimestamp=Thu 2023-09-28 10:44:12 CEST
ExecMainStartTimestampMonotonic=91909234
ExecMainExitTimestampMonotonic=0
ExecMainPID=2465
ExecMainCode=0
ExecMainStatus=0
ExecStart={ path=/usr/share/elasticsearch/bin/systemd-entrypoint ; argv[]=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
ExecStartEx={ path=/usr/share/elasticsearch/bin/systemd-entrypoint ; argv[]=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet ; flags= ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
Slice=system.slice
ControlGroup=/system.slice/elasticsearch.service
MemoryCurrent=6604414976
CPUUsageNSec=[not set]
EffectiveCPUs=
EffectiveMemoryNodes=
TasksCurrent=165
IPIngressBytes=[no data]
IPIngressPackets=[no data]
IPEgressBytes=[no data]
IPEgressPackets=[no data]
IOReadBytes=18446744073709551615
IOReadOperations=18446744073709551615
IOWriteBytes=18446744073709551615
IOWriteOperations=18446744073709551615
Delegate=no
CPUAccounting=no
CPUWeight=[not set]
StartupCPUWeight=[not set]
CPUShares=[not set]
StartupCPUShares=[not set]
CPUQuotaPerSecUSec=infinity
CPUQuotaPeriodUSec=infinity
AllowedCPUs=
AllowedMemoryNodes=
IOAccounting=no
IOWeight=[not set]
StartupIOWeight=[not set]
BlockIOAccounting=no
BlockIOWeight=[not set]
StartupBlockIOWeight=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
DevicePolicy=auto
TasksAccounting=yes
TasksMax=9425
IPAccounting=no
Environment=ES_HOME=/usr/share/elasticsearch ES_PATH_CONF=/etc/elasticsearch PID_DIR=/var/run/elasticsearch ES_SD_NOTIFY=true
EnvironmentFiles=/etc/default/elasticsearch (ignore_errors=yes)
UMask=0022
LimitCPU=infinity
LimitCPUSoft=infinity
LimitFSIZE=infinity
LimitFSIZESoft=infinity
LimitDATA=infinity
LimitDATASoft=infinity
LimitSTACK=infinity
LimitSTACKSoft=8388608
LimitCORE=infinity
LimitCORESoft=0
LimitRSS=infinity
LimitRSSSoft=infinity
LimitNOFILE=65535
LimitNOFILESoft=65535
LimitAS=infinity
LimitASSoft=infinity
LimitNPROC=4096
LimitNPROCSoft=4096
LimitMEMLOCK=infinity
LimitMEMLOCKSoft=infinity
LimitLOCKS=infinity
LimitLOCKSSoft=infinity
LimitSIGPENDING=31419
LimitSIGPENDINGSoft=31419
LimitMSGQUEUE=819200
LimitMSGQUEUESoft=819200
LimitNICE=0
LimitNICESoft=0
LimitRTPRIO=0
LimitRTPRIOSoft=0
LimitRTTIME=infinity
LimitRTTIMESoft=infinity
WorkingDirectory=/usr/share/elasticsearch
OOMScoreAdjust=0
Nice=0
IOSchedulingClass=0
IOSchedulingPriority=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
CPUAffinity=
CPUAffinityFromNUMA=no
NUMAPolicy=n/a
NUMAMask=
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardInputData=
StandardOutput=journal
StandardError=inherit
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SyslogLevel=6
SyslogFacility=3
LogLevelMax=-1
LogRateLimitIntervalUSec=0
LogRateLimitBurst=0
SecureBits=0
CapabilityBoundingSet=cap_chown cap_dac_override cap_dac_read_search cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin cap_net_raw cap_ipc_lock cap_ipc_owner cap_sys_module cap_sys_rawio cap_sys_chroot cap_sys_ptrace cap_sys_pacct cap_sys_admin cap_sys_boot cap_sys_nice cap_sys_resource cap_sys_time cap_sys_tty_config cap_mknod cap_lease cap_audit_write cap_audit_control cap_setfcap cap_mac_override cap_mac_admin cap_syslog cap_wake_alarm cap_block_suspend cap_audit_read
AmbientCapabilities=
User=elasticsearch
Group=elasticsearch
DynamicUser=no
RemoveIPC=no
MountFlags=
PrivateTmp=yes
PrivateDevices=no
ProtectKernelTunables=no
ProtectKernelModules=no
ProtectKernelLogs=no
ProtectControlGroups=no
PrivateNetwork=no
PrivateUsers=no
PrivateMounts=no
ProtectHome=no
ProtectSystem=no
SameProcessGroup=no
UtmpMode=init
IgnoreSIGPIPE=yes
NoNewPrivileges=no
SystemCallErrorNumber=0
LockPersonality=no
RuntimeDirectoryPreserve=no
RuntimeDirectoryMode=0755
RuntimeDirectory=elasticsearch
StateDirectoryMode=0755
CacheDirectoryMode=0755
LogsDirectoryMode=0755
ConfigurationDirectoryMode=0755
TimeoutCleanUSec=infinity
MemoryDenyWriteExecute=no
RestrictRealtime=no
RestrictSUIDSGID=no
RestrictNamespaces=no
MountAPIVFS=no
KeyringMode=private
ProtectHostname=no
KillMode=process
KillSignal=15
RestartKillSignal=15
FinalKillSignal=9
SendSIGKILL=no
SendSIGHUP=no
WatchdogSignal=6
Id=elasticsearch.service
Names=elasticsearch.service
Requires=sysinit.target system.slice -.mount
Wants=network-online.target
WantedBy=multi-user.target
Conflicts=shutdown.target
Before=multi-user.target shutdown.target
After=systemd-journald.socket -.mount system.slice basic.target sysinit.target network-online.target systemd-tmpfiles-setup.service
RequiresMountsFor=/tmp /var/tmp /run/elasticsearch /usr/share/elasticsearch
Documentation=https://www.elastic.co
Description=Elasticsearch
LoadState=loaded
ActiveState=active
SubState=running
FragmentPath=/usr/lib/systemd/system/elasticsearch.service
DropInPaths=/etc/systemd/system/elasticsearch.service.d/override.conf
UnitFileState=enabled
UnitFilePreset=enabled
StateChangeTimestamp=Thu 2023-09-28 10:45:02 CEST
StateChangeTimestampMonotonic=141699634
InactiveExitTimestamp=Thu 2023-09-28 10:44:12 CEST
InactiveExitTimestampMonotonic=91909577
ActiveEnterTimestamp=Thu 2023-09-28 10:45:02 CEST
ActiveEnterTimestampMonotonic=141699634
ActiveExitTimestampMonotonic=0
InactiveEnterTimestampMonotonic=0
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
CanClean=runtime
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=no
NeedDaemonReload=no
JobTimeoutUSec=infinity
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none
ConditionResult=yes
AssertResult=yes
ConditionTimestamp=Thu 2023-09-28 10:44:12 CEST
ConditionTimestampMonotonic=91906857
AssertTimestamp=Thu 2023-09-28 10:44:12 CEST
AssertTimestampMonotonic=91906857
Transient=no
Perpetual=no
StartLimitIntervalUSec=10s
StartLimitBurst=5
StartLimitAction=none
FailureAction=none
SuccessAction=none
InvocationID=4f28f67c4fda4d70bd994ce417ef1f03
CollectMode=inactive
GET /

{
  "name": "redacted",
  "cluster_name": "redacted",
  "cluster_uuid": "MO0h-6amRzmUrahDwUyd4Q",
  "version": {
    "number": "8.10.2",
    "build_flavor": "default",
    "build_type": "deb",
    "build_hash": "6d20dd8ce62365be9b1aca96427de4622e970e9e",
    "build_date": "2023-09-19T08:16:24.564900370Z",
    "build_snapshot": false,
    "lucene_version": "9.7.0",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "You Know, for Search"
}
GET _nodes/stats/process

{
  "_nodes": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "cluster_name": "redacted",
  "nodes": {
    "LcJxYbj5QPiJoKGxGMVTRw": {
      "timestamp": 1695891961430,
      "name": "redacted",
      "transport_address": "10.35.6.16:9300",
      "host": "10.35.6.16",
      "ip": "10.35.6.16:9300",
      "roles": [
        "data",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors_double": "4.0",
        "ml.allocated_processors": "4",
        "ml.machine_memory": "8331182080",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "10.0.0",
        "ml.max_jvm_size": "4294967296"
      },
      "process": {
        "timestamp": 1695891961096,
        "open_file_descriptors": 4900,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 34,
          "total_in_millis": 586047080
        },
        "mem": {
          "total_virtual_in_bytes": 398558937088
        }
      }
    },
    "fpHmBFy1QLSoIbxE5DEBsQ": {
      "timestamp": 1695891961442,
      "name": "redacted",
      "transport_address": "10.35.6.17:9300",
      "host": "10.35.6.17",
      "ip": "10.35.6.17:9300",
      "roles": [
        "data",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors_double": "4.0",
        "ml.allocated_processors": "4",
        "ml.machine_memory": "8331186176",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "10.0.0",
        "ml.max_jvm_size": "4294967296"
      },
      "process": {
        "timestamp": 1695891961110,
        "open_file_descriptors": 2776,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 53,
          "total_in_millis": 5383620
        },
        "mem": {
          "total_virtual_in_bytes": 134904147968
        }
      }
    },
    "gZNoi6l8RrCyH4uD7BpYTg": {
      "timestamp": 1695891961420,
      "name": "redacted",
      "transport_address": "10.35.6.18:9300",
      "host": "10.35.6.18",
      "ip": "10.35.6.18:9300",
      "roles": [
        "data",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors_double": "4.0",
        "ml.allocated_processors": "4",
        "ml.machine_memory": "8331423744",
        "xpack.installed": "true",
        "transform.config_version": "10.0.0",
        "ml.config_version": "10.0.0",
        "ml.max_jvm_size": "4294967296"
      },
      "process": {
        "timestamp": 1695891960798,
        "open_file_descriptors": 4962,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 57,
          "total_in_millis": 1993530
        },
        "mem": {
          "total_virtual_in_bytes": 377759141888
        }
      }
    }
  }
}

Let me know if you need any further information.

tonyghiani commented 1 year ago

Hey @msafdal, as you correctly mentioned it does depend on how the containerization detection is done, we noticed this from another report and we are updating the way this flow is detected on this PR.

Regarding the limits, the default values are unset or infinity, which is equivalent to not having them set. You can check each property's meaning and set it accordingly in your override file to set the limit that the control group should use.

k4z4n0v4 commented 1 year ago

Hijacking the thread to ask what "Kibana is configured for non-containerized workloads" means. I'm running the stack on docker swarm and started receiving the same alert after 8.10.2. Couldn't find anything regarding "telling kibana its running in a container". What's my fix given the alert is right and i haven't configured kibana properly?

EDIT: I set monitoring.ui.container.elasticsearch.enabled: true in kibana.yml as per @miltonhultgren 's comment, and after restarting the kibana's docker i get the opposite alert now: image

I'm guessing the rule is somewhat inconsistent even for truly containerized stacks now.

miltonhultgren commented 1 year ago

@k4z4n0v4 This is a miss on our part, we didn't consider the case where someone is running in a container/cgroup without limits on purpose. We have a fix coming out in the next patch for this but in the meantime you could work around this by setting the limit on your containers to 100% of your available CPU.

willemdh commented 1 year ago

This triggers on all or nodes since the update to 8.10.2. Our nodes are not containerized, we have no limits configured afaik. (although the alerts say we do)

miltonhultgren commented 1 year ago

@willemdh If the alert is reporting that you have limits specified then that is because that's what Elasticsearch is reporting, and in that case you should most likely configure Kibana to monitor a containerized workload (container or cgroup based) so that the CPU calculation is correct.

leandrojmp commented 1 year ago

Just upgraded my monitoring cluster to 8.10.2 and got the same alert for all of my 20 nodes.

I do not use containers, I run on normal VMs, not sure what should I do to fix this.

kibana-alert

Added the following line into kibana.yml

monitoring.ui.container.elasticsearch.enabled: true

But now the alert is the inverse for all my nodes.

kibana-alert-2

I'm using the rpm package and systemd uses cgroups to run elasticsearch.

So, it seems that there is no workaround for this, the solution is to disable the rule and wait for the fix on https://github.com/elastic/kibana/pull/167244

miltonhultgren commented 1 year ago

@leandrojmp Is it not possible to define the limit on your cgroup to 100% of your CPU (which is the same as not having the limit but it'll make the rule happy)?

Either way, it seems odd that you're getting both sides of the issue. Either you have the cgroup metrics being reported or not, I'm not sure what's going on there. If you hit /_nodes/_local/stats on the Elasticsearch node giving the alert, do you see the cgroup metrics filled in with a quota?

leandrojmp commented 1 year ago

Hello @miltonhultgren,

Is it not possible to define the limit on your cgroup to 100% of your CPU (which is the same as not having the limit but it'll make the rule happy)?

I didn't make any changes to cgroups or applied any limits, I'm running the default rpm package distribution, I just installed the package, configured Elasticsearch and started the service.

This is how systemd works, it uses cgroups, this is the return of systemctl status elasticsearch on one of the nodes:

● elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/elasticsearch.service.d
           └─override.conf
   Active: active (running) since Tue 2023-06-20 23:42:20 UTC; 3 months 13 days ago
     Docs: https://www.elastic.co
 Main PID: 1226 (java)
    Tasks: 359 (limit: 408607)
   Memory: 58.1G
   CGroup: /system.slice/elasticsearch.service
           ├─1226 /usr/share/elasticsearch/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=server -Dcli.script=/usr/share/elasticsearch/bin/elasticsearch -Dcli.libs=lib/tools/server-cli -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.type=rpm -cp /usr/share/elast>
           ├─3633 /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.ne>
           └─4464 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

So everything is default, probably this affects anyone that runs elasticsearch using the rpm or deb packages, I prefer to not change anything related to cgroup because I'm not familiar how this works with systemd and this is a production environment.

Either way, it seems odd that you're getting both sides of the issue. Either you have the cgroup metrics being reported or not, I'm not sure what's going on there.

Yeah, if i do not set monitoring.ui.container.elasticsearch.enabled on Kibana I got the error about a non-containerized workload with resources limits, but if I enable it I got the error about containerized workload without resources limits, either way I got alerts for all my nodes, should I open another issues to track this?

I upgraded just the Monitoring cluster to 8.10.2, the production cluster and metricbeat is still on 8.8.1, not sure if this may impact or not, but an upgrade for 8.10.2 in the production cluster is planned for this week.

If you hit /_nodes/_local/stats on the Elasticsearch node giving the alert, do you see the cgroup metrics filled in with a quota?

This happens for all nodes, and this is the cgroup part in the response for one of them:

        "cgroup": {
          "cpuacct": {
            "control_group": "/",
            "usage_nanos": 30412695915531812
          },
          "cpu": {
            "control_group": "/",
            "cfs_period_micros": 100000,
            "cfs_quota_micros": -1,
            "stat": {
              "number_of_elapsed_periods": 0,
              "number_of_times_throttled": 0,
              "time_throttled_nanos": 0
            }
          },
          "memory": {
            "control_group": "/system.slice/elasticsearch.service",
            "limit_in_bytes": "9223372036854771712",
            "usage_in_bytes": "62579138560"
          }
        }
miltonhultgren commented 1 year ago

Got it, thanks for the insight @leandrojmp ! This change had a bigger effect than we anticipated (the flag being named container is misleading) since it affects all cgroup run times, like you mentioned this is the default for some setups which we didn't expect (a miss on our part).

Thanks for sharing the results of the stats endpoint, I see the issue now. When Kibana is configured for non-container (non-cgroup*) workloads it used to check if the metric values are null (meaning not being reported at all), while in the container (cgroup) path is checks if they are not -1 so there isn't an exact overlap between the two cases. The fix introduced by https://github.com/elastic/kibana/pull/167244 should address both of those cases.

Apologies again for all the noise this is causing!

leandrojmp commented 1 year ago

@miltonhultgren

So when 8.11 drops I would need to upgrade just the monitoring cluster to not get the alerts anymore, right? Because We upgrade our production cluster every quarter, and we will upgrade to 8.10.2 this week and the next upgrade will be just next quarter.

miltonhultgren commented 1 year ago

@tonyghiani Did we backport this to 8.10.X or only 8.11.X? Let's make sure this comes out with the next patch release for 8.10!

@leandrojmp The alerting system only runs in your monitoring cluster's Kibana so upgrading that will be enough!

jacoor commented 1 year ago

@miltonhultgren Thanks for all the work on this. Could you confirm if this has been backported to 8.10.X and if so, which exact version? https://github.com/elastic/kibana/pull/167244 has only 8.11 label and backport: skip.

tonyghiani commented 1 year ago

@miltonhultgren apologies for the delay, I completely missed your mention here. The PR was not backported into 8.10.x, I'll see if it's possible to bring the same changes to the latest 8.10.x patch.

tonyghiani commented 1 year ago

This https://github.com/elastic/kibana/pull/170740 should backport the fix to v10.8.

tonyghiani commented 1 year ago

I closed the above PR since it won't be released with new patches for 8.10.x, so it'll be available starting from 8.11.0