Low disk watermark exceeded in CI pipeline, resulting in NoShardAvailableActionException and failing tests

dominiqueclarke commented 2 years ago

elastic-package version: v0.48.0 Stack versions: 8.2.0-SNAPSHOT, 8.3.0-SNAPSHOT Pipeline affected: https://apm-ci.elastic.co/job/apm-agent-rum/job/e2e-synthetics-mbp/ Tests that run on this pipeline: https://github.com/elastic/synthetics/blob/main/__tests__/e2e/synthetics.journey.ts Documentation for these tests: https://github.com/elastic/synthetics/tree/main/__tests__/e2e Script where elastic-package is invoked: https://github.com/elastic/synthetics/blob/main/__tests__/e2e/scripts/setup_integration.sh

The cluster spun up for Elastic Synthetics e2e tests is reporting low disk watermark exceeded, even when no Synthetics data is indexed.

The cluster is brought up with elastic-package stack up --version {version}.

This error was uncovered by extracting the ES logs, both with the Synthetics tests enabled and disabled. When the tests are disabled, ES still reports low disk watermark exceeded:

{"@timestamp":"2022-05-07T00:02:29.686Z", "log.level": "INFO", "message":"low disk watermark [85%] exceeded on [EpUAMaKqQaOdKT-scbq1Kg][2e2f4f385b0a][/usr/share/elasticsearch/data] free: 15.2gb[10.4%], replicas will not be assigned to this node", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[2e2f4f385b0a][generic][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.DiskThresholdMonitor","elasticsearch.node.name":"2e2f4f385b0a","elasticsearch.cluster.name":"elasticsearch"}

Full logs: https://apm-ci.elastic.co/job/apm-agent-rum/job/e2e-synthetics-mbp/view/change-requests/job/PR-499/25/consoleFull (Search for Fetching ES logs)

When tests are enabled, shards are not allocated for the Synthetics data, resulting in NoShardAvailableActionException.

{"error":{"root_cause":[{"type":"no_shard_available_action_exception","reason":null,"index_uuid":"U2HSiqeRQ7SFlqVGPhV73A","shard":"0","index":".ds-synthetics-browser-default-2022.05.06-000001"},{"type":"no_shard_available_action_exception","reason":null,"index_uuid":"8yU95eCjShqTwHQZO3Kyxg","shard":"0","index":".ds-synthetics-http-default-2022.05.06-000001"},{"type":"no_shard_available_action_exception","reason":null,"index_uuid":"bqS1e4RWS3WiJtzFKk3TgQ","shard":"0","index":".ds-synthetics-tcp-default-2022.05.06-000001"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":".ds-synthetics-browser-default-2022.05.06-000001","node":null,"reason":{"type":"no_shard_available_action_exception","reason":null,"index_uuid":"U2HSiqeRQ7SFlqVGPhV73A","shard":"0","index":".ds-synthetics-browser-default-2022.05.06-000001"}},{"shard":0,"index":".ds-synthetics-http-default-2022.05.06-000001","node":null,"reason":{"type":"no_shard_available_action_exception","reason":null,"index_uuid":"8yU95eCjShqTwHQZO3Kyxg","shard":"0","index":".ds-synthetics-http-default-2022.05.06-000001"}},{"shard":0,"index":".ds-synthetics-tcp-default-2022.05.06-000001","node":null,"reason":{"type":"no_shard_available_action_exception","reason":null,"index_uuid":"bqS1e4RWS3WiJtzFKk3TgQ","shard":"0","index":".ds-synthetics-tcp-default-2022.05.06-000001"}}]},"status":503}

Full logs https://apm-ci.elastic.co/job/apm-agent-rum/job/e2e-synthetics-mbp/view/change-requests/job/PR-499/24/consoleFull (Search no_shared_available_action_exception or NoShardAvailableActionException)

This has caused Elastic Synthetics e2e tests to fail consistently for the last few days.

mtojek commented 2 years ago

I don't think that this is elastic-package's or Elasticsearch's error. It looks like the host's disk is full. I used to observe those whenever I don't clean my Docker images. Have you checked disk capacity?

kuisathaverat commented 2 years ago

these are Ubuntu 18 agents, checking the daily test we make on those agents they have about 22GB of free space before running anything

[2022-05-09T05:04:07.097Z] Filesystem      Size  Used Avail Use% Mounted on
[2022-05-09T05:04:07.097Z] udev            7.4G     0  7.4G   0% /dev
[2022-05-09T05:04:07.097Z] tmpfs           1.5G  900K  1.5G   1% /run
[2022-05-09T05:04:07.097Z] /dev/sda1       146G  125G   22G  86% /
[2022-05-09T05:04:07.097Z] tmpfs           7.4G     0  7.4G   0% /dev/shm
[2022-05-09T05:04:07.097Z] tmpfs           5.0M     0  5.0M   0% /run/lock
[2022-05-09T05:04:07.097Z] tmpfs           7.4G     0  7.4G   0% /sys/fs/cgroup
[2022-05-09T05:04:07.098Z] /dev/loop0      295M  295M     0 100% /snap/google-cloud-sdk/239
[2022-05-09T05:04:07.098Z] /dev/loop1       56M   56M     0 100% /snap/core18/2344
[2022-05-09T05:04:07.098Z] /dev/loop2       45M   45M     0 100% /snap/snapd/15534
[2022-05-09T05:04:07.098Z] /dev/sda15      105M  4.4M  100M   5% /boot/efi

mtojek commented 2 years ago

@dominiqueclarke You may want to compare those stats with the moment when elastic-package stack fails.

dominiqueclarke commented 2 years ago

@mtojek @kuisathaverat

So the error reported today is actually https://github.com/elastic/uptime-dev/issues/99. Nothing on my end has changed with regard to the error reported in the linked issue versus the error reported in this issue. Was able to reproduce this error on Friday and now only able to reproduce https://github.com/elastic/uptime-dev/issues/99 with the same script.

Disk usage reported after the Kibana container reported unhealthy status Test run: https://apm-ci.elastic.co/job/apm-agent-rum/job/e2e-synthetics-mbp/view/change-requests/job/PR-499/28/console (You can search for Disk usage...)

14:52:20  Filesystem      Size  Used Avail Use% Mounted on
14:52:20  udev            7.4G     0  7.4G   0% /dev
14:52:20  tmpfs           1.5G  1.1M  1.5G   1% /run
14:52:20  /dev/sda1       146G  132G   15G  91% /
14:52:20  tmpfs           7.4G     0  7.4G   0% /dev/shm
14:52:20  tmpfs           5.0M     0  5.0M   0% /run/lock
14:52:20  tmpfs           7.4G     0  7.4G   0% /sys/fs/cgroup
14:52:20  /dev/loop0      295M  295M     0 100% /snap/google-cloud-sdk/239
14:52:20  /dev/loop1       45M   45M     0 100% /snap/snapd/15534
14:52:20  /dev/loop2       56M   56M     0 100% /snap/core18/2344
14:52:20  /dev/sda15      105M  4.4M  100M   5% /boot/efi

kuisathaverat commented 2 years ago

could you add these commands to the end of the execution?

docker ps -a
docker stats --no-stream  --no-trunc

mtojek commented 2 years ago

Dominique, I pulled all stack logs from the Integrations repository and run grep -rni watermark against them. I didn't find any issue in any of that logs (different stack versions).

I'm afraid that it might a problem related specifically with Synthetics integration and you may want to start digging there.

elastic / elastic-package

Low disk watermark exceeded in CI pipeline, resulting in NoShardAvailableActionException and failing tests #813