Large experiments cause instability

mctigger commented 2 years ago

This is similar to the problems that we discussed here https://clearml.slack.com/archives/CTK20V944/p1622050547263400, however still exist with the current clearml-server and can be reproduced by me:

Run experiments with millions of steps with logging done every few steps with multiple scalar values, plots and more rarely debug samples
Try to delete one or multiple experiments via the WebUI
Deletion will succeed for small experiments (e.g. 50k steps) but fail after some time for large experiments.
From now on loading of "Results" in the WebUI will fail very often for any experiment
The server become instable and will fail to load completely sometimes

This behavior does not happen with a fresh server. Deletion works fine even with large experiments. After some time the problem arises and stays until I restart the server (so far, with older clearml-server versions it persisted after restart).

I will add error logs to this issue in the future when the problems arise again. If someone has similar experiences please share here, so I can find out whether this is something that is specific to my machine.

jkhenning commented 2 years ago

Hן @mctigger ,

This sounds like indexing issues in the ES service - probably something that takes a long time and thus returns a timeout to the server.

When it happens, can you get some logs from the apiserver and elasticsearch services? Use sudo docker logs clearml-apiserver and sudo docker logs clearml-elastic

mctigger commented 2 years ago

The following experiment failed to reset

Error (shortened)

`General data error (ConflictError(409, '{"took":35,"timed_out":false,"total":660991,"deleted":0,"batches":1,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"cef11b7d08cbb5704c6d62532343288a","cause":{"type":"version_conflict_engine_exception","reason":"[cef11b7d08cbb5704c6d62532343288a]: version conflict, required seqNo [313285126], primary term [2]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"20a332379d85b1f3c1f741e749c6422f","cause":{"type":"version_conflict_engine_exception","reason":"[20a332379d85b1f3c1f741e749c6422f]: version conflict, required seqNo [313285127], primary term [2]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"097eb0c793718bc37323ae5d8d54e7e3","cause":{"type":"version_conflict_engine_exception","reason":"[097eb0c793718bc37323ae5d8d54e7e3]: version conflict, required seqNo [313285128], primary term [2]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a5983f8163d312697e8384f326f0049d","cause":{"type":"version_conflict_engine_exception","reason":"[a5983f8163d312697e8384f326f0049d]: version conflict, required seqNo [313285129], primary term [2]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"247952ddd2c7652d05aae0b2850b1db3","cause": ... `

mctigger commented 2 years ago

sudo docker logs clearml-apiserver

Here are the only errors in this log. But the error happened 2 hours earlier.

[2021-09-22 23:57:58,758] [9] [WARNING] [elasticsearch] POST http://elasticsearch:9200/events-*-d1bd92a3b039400cbafc60a7a5b1e52b/_delete_by_query?refresh=true [status:N/A request:60.050s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib64/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 507, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
[2021-09-22 23:57:58,791] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 208ms
[2021-09-22 23:57:58,823] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 3ms
[2021-09-22 23:57:58,824] [9] [WARNING] [elasticsearch] POST http://elasticsearch:9200/events-*-d1bd92a3b039400cbafc60a7a5b1e52b/_delete_by_query?refresh=true [status:409 request:0.065s]

sudo docker logs clearml-elastic

OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
{"type": "server", "timestamp": "2021-09-22T10:17:25,047Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/nvme0n1p2)]], net usable_space [1.3tb], net total_space [1.7tb], types [ext4]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,048Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "heap size [15.6gb], compressed ordinary object pointers [true]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "node name [clearml], node ID [JUV7hB8TQKyhmdMIk_TmaA], cluster name [clearml]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "version[7.6.2], pid[1], build[default/docker/ef48eb35cf30adf4db14086e8aabd07ef6fb113f/2020-03-26T06:34:37.794943Z], OS[Linux/5.11.0-27-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.2/13.0.2+8]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM home [/usr/share/elasticsearch/jdk]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-10421376390742623811, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/, -Xms16g, -Xmx16g, -XX:MaxDirectMemorySize=8589934592, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,964Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [aggs-matrix-stats]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,964Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [analysis-common]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [flattened]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [frozen-indices]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-common]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-geoip]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-user-agent]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-expression]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-mustache]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-painless]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [mapper-extras]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [parent-join]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [percolator]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [rank-eval]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [reindex]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [repository-url]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [search-business-rules]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [spatial]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transform]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transport-netty4]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [vectors]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-analytics]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ccr]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-core]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-deprecation]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-enrich]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-graph]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ilm]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-logstash]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ml]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-monitoring]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-rollup]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-security]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-sql]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-voting-only-node]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-watcher]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "no plugins loaded" }
{"type": "deprecation", "timestamp": "2021-09-22T10:17:27,055Z", "level": "WARN", "component": "o.e.d.c.s.Settings", "cluster.name": "clearml", "node.name": "clearml", "message": "[discovery.zen.minimum_master_nodes] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version." }
{"type": "server", "timestamp": "2021-09-22T10:17:28,347Z", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "clearml", "node.name": "clearml", "message": "[controller/127] [Main.cc@110] controller (64 bit): Version 7.6.2 (Build e06ef9d86d5332) Copyright (c) 2020 Elasticsearch BV" }
{"type": "server", "timestamp": "2021-09-22T10:17:28,571Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "clearml", "node.name": "clearml", "message": "using discovery type [single-node] and seed hosts providers [settings]" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,028Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "initialized" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,028Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "starting ..." }
{"type": "server", "timestamp": "2021-09-22T10:17:29,282Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.22.0.4:9300}, bound_addresses {0.0.0.0:9300}" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,485Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "clearml", "node.name": "clearml", "message": "cluster UUID [hK8xFPPDSqqlBHbO-CN4kw]" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,669Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "clearml", "node.name": "clearml", "message": "elected-as-master ([1] nodes joined)[{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{BLuP1kWDTZib1rI5NW35iw}{172.22.0.4}{172.22.0.4:9300}{dilm}{ml.machine_memory=134994567168, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 4, version: 65, delta: master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{BLuP1kWDTZib1rI5NW35iw}{172.22.0.4}{172.22.0.4:9300}{dilm}{ml.machine_memory=134994567168, xpack.installed=true, ml.max_open_jobs=20}]}" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,720Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "clearml", "node.name": "clearml", "message": "master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{BLuP1kWDTZib1rI5NW35iw}{172.22.0.4}{172.22.0.4:9300}{dilm}{ml.machine_memory=134994567168, xpack.installed=true, ml.max_open_jobs=20}]}, term: 4, version: 65, reason: Publication{term=4, version=65}" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,746Z", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.22.0.4:9200}, bound_addresses {0.0.0.0:9200}", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:29,746Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "started", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:29,831Z", "level": "INFO", "component": "o.e.l.LicenseService", "cluster.name": "clearml", "node.name": "clearml", "message": "license [3694cc4d-2664-43fc-8763-57b48da7f0c9] mode [basic] - valid", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:29,832Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "clearml", "node.name": "clearml", "message": "recovered [10] indices into cluster_state", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:31,639Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "clearml", "node.name": "clearml", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:54,335Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events] for index patterns [events-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:54,362Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_log] for index patterns [events-log-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:54,389Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_plot] for index patterns [events-plot-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:54,415Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_training_debug_image] for index patterns [events-training_debug_image-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:54,449Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [worker_stats] for index patterns [worker_stats_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }
{"type": "server", "timestamp": "2021-09-22T10:17:54,473Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [queue_metrics] for index patterns [queue_metrics_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA"  }

mctigger commented 2 years ago

Just tried again resetting and it worked. So just I like for deletion, the errors seem to fix themselves after a while...

jkhenning commented 2 years ago

This looks like it takes time for ES to reindex and remove the deleted documents from the index, during which time any requests from the server may result in these errors... I think it's just a matter of hiding these errors as reindex times are load/performance related...

evg-allegro commented 2 years ago

Hi @mctigger, it seems that massive delete operations create a heavy load on Elasticsearch. I would try increasing the memory allocated to Elasticsearch jvm if you did not do it already. The default setting that we have in the docker-compose for elasticsearch service is pretty conservative ES_JAVA_OPTS: -Xms2g -Xmx2g It allows ES to use up to 2Gb of RAM. I would increase these numbers to up to the half of your available RAM (but not more than 32Gb)

mctigger commented 2 years ago

I already had it set to use 16G of RAM. Everything else is set to default.

mctigger commented 2 years ago

After restarting the server because it became unresponsive I cannot do any operations on my experiments. However, viewing them is still possible.

Here is an example error message. I tried to enqueue a task:

General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10', '_type': '_doc', '_id': 'T-YUmHwB6OL9GSKaKE15', 'status': 429, 'error': {'type':..., extra_info=rejected execution of processing of [1079239][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][_doc][T-YUmHwB6OL9GSKaKE15], source[na]}], target allocation id: guUOHYi2SH2FlFzbwzKCIQ, primary term: 4 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@310e6893[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 409896]]

mctigger commented 2 years ago

Or when trying to reset a task: https://gist.github.com/mctigger/b4b5791b758c4ec651f9fb180be0f19e

jkhenning commented 2 years ago

Hi @mctigger,

This seems like an ES index error, can you share the ES log (using sudo docker logs clearml-elastic)?

mctigger commented 2 years ago

Here it is (I increased memory to 64GB for elastic):

OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. {"type": "server", "timestamp": "2021-10-19T11:48:05,276Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/nvme0n1p2)]], net usable_space [1tb], net total_space [1.7tb], types [ext4]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,278Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "heap size [63.6gb], compressed ordinary object pointers [false]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,363Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "node name [clearml], node ID [JUV7hB8TQKyhmdMIk_TmaA], cluster name [clearml]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,363Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "version[7.6.2], pid[1], build[default/docker/ef48eb35cf30adf4db14086e8aabd07ef6fb113f/2020-03-26T06:34:37.794943Z], OS[Linux/5.11.0-37-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.2/13.0.2+8]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,364Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM home [/usr/share/elasticsearch/jdk]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,364Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-1244715195998405414, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/, -Xms64g, -Xmx64g, -XX:MaxDirectMemorySize=34359738368, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,348Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [aggs-matrix-stats]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [analysis-common]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [flattened]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [frozen-indices]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-common]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-geoip]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-user-agent]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-expression]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-mustache]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-painless]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [mapper-extras]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [parent-join]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [percolator]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [rank-eval]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [reindex]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [repository-url]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [search-business-rules]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [spatial]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transform]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transport-netty4]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [vectors]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-analytics]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ccr]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-core]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-deprecation]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-enrich]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-graph]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ilm]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-logstash]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ml]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-monitoring]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-rollup]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-security]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-sql]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-voting-only-node]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-watcher]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "no plugins loaded" } {"type": "deprecation", "timestamp": "2021-10-19T11:48:08,877Z", "level": "WARN", "component": "o.e.d.c.s.Settings", "cluster.name": "clearml", "node.name": "clearml", "message": "[discovery.zen.minimum_master_nodes] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version." } {"type": "server", "timestamp": "2021-10-19T11:48:11,496Z", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "clearml", "node.name": "clearml", "message": "[controller/128] [Main.cc@110] controller (64 bit): Version 7.6.2 (Build e06ef9d86d5332) Copyright (c) 2020 Elasticsearch BV" } {"type": "server", "timestamp": "2021-10-19T11:48:11,847Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "clearml", "node.name": "clearml", "message": "using discovery type [single-node] and seed hosts providers [settings]" } {"type": "server", "timestamp": "2021-10-19T11:48:12,245Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "initialized" } {"type": "server", "timestamp": "2021-10-19T11:48:12,246Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "starting ..." } {"type": "server", "timestamp": "2021-10-19T11:48:12,356Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.20.0.3:9300}, bound_addresses {0.0.0.0:9300}" } {"type": "server", "timestamp": "2021-10-19T11:48:12,536Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "clearml", "node.name": "clearml", "message": "cluster UUID [hK8xFPPDSqqlBHbO-CN4kw]" } {"type": "server", "timestamp": "2021-10-19T11:48:12,586Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "clearml", "node.name": "clearml", "message": "elected-as-master ([1] nodes joined)[{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{79M0I9VCTeO0x5AVIfNuAA}{172.20.0.3}{172.20.0.3:9300}{dilm}{ml.machine_memory=134927069184, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 11, version: 171, delta: master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{79M0I9VCTeO0x5AVIfNuAA}{172.20.0.3}{172.20.0.3:9300}{dilm}{ml.machine_memory=134927069184, xpack.installed=true, ml.max_open_jobs=20}]}" } {"type": "server", "timestamp": "2021-10-19T11:48:12,639Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "clearml", "node.name": "clearml", "message": "master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{79M0I9VCTeO0x5AVIfNuAA}{172.20.0.3}{172.20.0.3:9300}{dilm}{ml.machine_memory=134927069184, xpack.installed=true, ml.max_open_jobs=20}]}, term: 11, version: 171, reason: Publication{term=11, version=171}" } {"type": "server", "timestamp": "2021-10-19T11:48:12,662Z", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.20.0.3:9200}, bound_addresses {0.0.0.0:9200}", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:12,662Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "started", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:12,735Z", "level": "INFO", "component": "o.e.l.LicenseService", "cluster.name": "clearml", "node.name": "clearml", "message": "license [3694cc4d-2664-43fc-8763-57b48da7f0c9] mode [basic] - valid", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:12,736Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "clearml", "node.name": "clearml", "message": "recovered [12] indices into cluster_state", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:14,719Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "clearml", "node.name": "clearml", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,879Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events] for index patterns [events-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,911Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_log] for index patterns [events-log-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,939Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_plot] for index patterns [events-plot-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,965Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_training_debug_image] for index patterns [events-training_debug_image-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,999Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [worker_stats] for index patterns [worker_stats_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:25,023Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [queue_metrics] for index patterns [queue_metrics_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:37,352Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][265] overhead, spent [368ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:40,512Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][268] overhead, spent [371ms] collecting in the last [1.1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:44,514Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][272] overhead, spent [522ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:47,907Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][275] overhead, spent [504ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:51,908Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][279] overhead, spent [554ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:55,279Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][282] overhead, spent [430ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:59,280Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][286] overhead, spent [405ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:02,669Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][289] overhead, spent [446ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:06,670Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][293] overhead, spent [419ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:09,984Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][296] overhead, spent [452ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:13,985Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][300] overhead, spent [460ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:17,191Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][303] overhead, spent [445ms] collecting in the last [1.2s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:20,484Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][306] overhead, spent [405ms] collecting in the last [1.2s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:23,816Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][309] overhead, spent [414ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:27,132Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][312] overhead, spent [377ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:31,133Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][316] overhead, spent [410ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:34,134Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][319] overhead, spent [276ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:41,445Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][326] overhead, spent [308ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:44,446Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][329] overhead, spent [277ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:47,516Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][332] overhead, spent [306ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:51,517Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][336] overhead, spent [421ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:56,518Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][341] overhead, spent [256ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:54:13,716Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][358] overhead, spent [293ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:55:56,959Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][461] overhead, spent [352ms] collecting in the last [1.1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:02,253Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][466] overhead, spent [428ms] collecting in the last [1.2s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:07,280Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][471] overhead, spent [447ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:12,299Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][476] overhead, spent [466ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:17,675Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][481] overhead, spent [449ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:22,735Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][486] overhead, spent [443ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:29,737Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][493] overhead, spent [433ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:43,740Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][507] overhead, spent [382ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:54,742Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][518] overhead, spent [264ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }

jkhenning commented 2 years ago

I don't see any error here, but the error should appear here...

mctigger commented 2 years ago

From the timestamp it seems like this is only a very small part of the log, however with the command you provided this is the only part that shows.

jkhenning commented 2 years ago

Perhaps you've restarted the Elastic since? the command should output all of the logs...

mctigger commented 2 years ago

Not really... I will create a fresh server, then I will not change anything until the error happens again and just send you all the logs.

mctigger commented 2 years ago

Maybe my problems are related to ElasticSearch just being really slow? It seems I tried to delete a experiment and ES throws error since maybe some parts are already deleted, but the deletion is not yet finished? Shouldn't be such a delete near instant? I have nearly no CPU utilization, but many processes running in context to ES!

Error (shortened)

`General data error (ConflictError(409, '{"took":71,"timed_out":false,"total":5370376,"deleted":0,"batches":1,"version_conflicts":1000,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"38ea5ba45e8a201a5f7b85ac270f771a","cause":{"type":"version_conflict_engine_exception","reason":"[38ea5ba45e8a201a5f7b85ac270f771a]: version conflict, required seqNo [245344419], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e447fdc03347c42c30ae2e8697e1443c","cause":{"type":"version_conflict_engine_exception","reason":"[e447fdc03347c42c30ae2e8697e1443c]: version conflict, required seqNo [245344420], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"7b51515d77e4737f66a9d596298ec434","cause":{"type":"version_conflict_engine_exception","reason":"[7b51515d77e4737f66a9d596298ec434]: version conflict, required seqNo [245344421], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"0ad0a9f9fc307bb325c3be5d239a0807","cause":{"type":"version_conflict_engine_exception","reason":"[0ad0a9f9fc307bb325c3be5d239a0807]: version conflict, required seqNo [245344355], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"7de948a11365faa544756ae6adb5f20f","cause":{"type":"version_conflict_engine_exception","reason":"[7de948a11365faa544756ae6adb5f20f]: version conflict, required seqNo [245344357], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"2bc8a054028b7d9a5bd4195a9c012cb9","cause":{"type":"version_conflict_engine_exception","reason":"[2bc8a054028b7d9a5bd4195a9c012cb9]: version conflict, required seqNo [245344375], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e8bfd72ac1cf0922728d6859bdc61ba2","cause":{"type":"version_conflict_engine_exception","reason":"[e8bfd72ac1cf0922728d6859bdc61ba2]: version conflict, required seqNo [245344377], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"3c6b0c8f1de82994680b84884dca3bcb","cause":{"type":"version_conflict_engine_exception","reason":"[3c6b0c8f1de82994680b84884dca3bcb]: version conflict, required seqNo [245344379], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"bcd2cde55c4238d2265f2969fd14f3c3","cause":{"type":"version_conflict_engine_exception","reason":"[bcd2cde55c4238d2265f2969fd14f3c3]: version conflict, required seqNo [245344381], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ae43ffc8ccf7a0429ddf207a1cfa84e6","cause":{"type":"version_conflict_engine_exception","reason":"[ae43ffc8ccf7a0429ddf207a1cfa84e6]: version conflict, required seqNo [245344383], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"21c65e35b755c617bdbbd3f6b0960198","cause":{"type":"version_conflict_engine_exception","reason":"[21c65e35b755c617bdbbd3f6b0960198]: version conflict, required seqNo [245344385], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ed9808e8dd9f80c3ad4996a594aaf407","cause":{"type":"version_conflict_engine_exception","reason":"[ed9808e8dd9f80c3ad4996a594aaf407]: version conflict, required seqNo [245344386], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e801416057645489cd29ab61b16b094e","cause":{"type":"version_conflict_engine_exception","reason":"[e801416057645489cd29ab61b16b094e]: version conflict, required seqNo [245344387], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ebc818303b09621ede3886bc408d58db","cause":{"type":"version_conflict_engine_exception","reason":"[ebc818303b09621ede3886bc408d58db]: version conflict, required seqNo [245344388], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"30b0d78c76d42c01824c00b31071580c","cause":{"type":"version_conflict_engine_exception","reason":"[30b0d78c76d42c01824c00b31071580c]: version conflict, required seqNo [245344389], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"482d51a5268bdb301eb5b4a7191b8fe1","cause":{"type":"version_conflict_engine_exception","reason":"[482d51a5268bdb301eb5b4a7191b8fe1]: version conflict, required seqNo [245344390], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"d5d534c543d1b5de0f205c640fa4d50d","cause":{"type":"version_conflict_engine_exception","reason":"[d5d534c543d1b5de0f205c640fa4d50d]: version conflict, required seqNo [245344391], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"12c678539683c2f6990b04ac89f221df","cause":{"type":"version_conflict_engine_exception","reason":"[12c678539683c2f6990b04ac89f221df]: version conflict, required seqNo [245344392], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"646df03969b3f1b01cd797382fbf09ec","cause":{"type":"version_conflict_engine_exception","reason":"[646df03969b3f1b01cd797382fbf09ec]: version conflict, required seqNo [245344393], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"1895081f5b2e06b107dc4e37566ace00","cause":{"type":"version_conflict_engine_exception","reason":"[1895081f5b2e06b107dc4e37566ace00]: version conflict, required seqNo [245344394], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"95c1ebd4acab05dc3dc21d9ba6956a66","cause":{"type":"version_conflict_engine_exception","reason":"[95c1ebd4acab05dc3dc21d9ba6956a66]: version conflict, required seqNo [245344423], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"1a67baa2f353eaf9186b38e22c8dd8ab","cause":{"type":"version_conflict_engine_exception","reason":"[1a67baa2f353eaf9186b38e22c8dd8ab]: version conflict, required seqNo [245344424], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"061fb23487b69fc9ee2dc033ac91b38f","cause":{"type":"version_conflict_engine_exception","reason":"[061fb23487b69fc9ee2dc033ac91b38f]: version conflict, required seqNo [245344425], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"decc1fd75b58996bc0c0962e5cc0e94c","cause":{"type":"version_conflict_engine_exception","reason":"[decc1fd75b58996bc0c0962e5cc0e94c]: version conflict, required seqNo [245344426], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"8c1ff9ca3380e01f8e27bf238931ca15","cause":{"type":"version_conflict_engine_exception","reason":"[8c1ff9ca3380e01f8e27bf238931ca15]: version conflict, required seqNo [245344427], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"d0c973bdb956d4416866cbecb1b82c33","cause":{"type":"version_conflict_engine_exception","reason":"[d0c973bdb956d4416866cbecb1b82c33]: version conflict, required seqNo [245344428], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"1764ea382df9aeedad13ef6df57502b2","cause":{"type":"version_conflict_engine_exception","reason":"[1764ea382df9aeedad13ef6df57502b2]: version conflict, required seqNo [245344429], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"f90c9dbc48fb6596bf9ff1ee00304d89","cause":{"type":"version_conflict_engine_exception","reason":"[f90c9dbc48fb6596bf9ff1ee00304d89]: version conflict, required seqNo [245344430], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a135e54d051232bff9cfdc6f9c28561f","cause":{"type":"version_conflict_engine_exception","reason":"[a135e54d051232bff9cfdc6f9c28561f]: version conflict, required seqNo [245344431], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"574fc72ef6bb4dcab491fc3835aaf8ec","cause":{"type":"version_conflict_engine_exception","reason":"[574fc72ef6bb4dcab491fc3835aaf8ec]: version conflict, required seqNo [245344396], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"31f2ab88ebf3b4ebfa99f78ac72c5f85","cause":{"type":"version_conflict_engine_exception","reason":"[31f2ab88ebf3b4ebfa99f78ac72c5f85]: version conflict, required seqNo [245344403], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"36abe20f1134507ee77afba9c080fc44","cause":{"type":"version_conflict_engine_exception","reason":"[36abe20f1134507ee77afba9c080fc44]: version conflict, required seqNo [245344907], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a92f16fc0bb68fb41c4139a9dcb48d90","cause":{"type":"version_conflict_engine_exception","reason":"[a92f16fc0bb68fb41c4139a9dcb48d90]: version conflict, required seqNo [245344908], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"22c58715c9296627d595dc29e9f8d0b9","cause":{"type":"version_conflict_engine_exception","reason":"[22c58715c9296627d595dc29e9f8d0b9]: version conflict, required seqNo [245344909], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"d0e62d3c74543992282bd62167f976c5","cause":{"type":"version_conflict_engine_exception","reason":"[d0e62d3c74543992282bd62167f976c5]: version conflict, required seqNo [245344910], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"18d3c289d2ab2514c80c5cca5da0f020","cause":{"type":"version_conflict_engine_exception","reason":"[18d3c289d2ab2514c80c5cca5da0f020]: version conflict, required seqNo [245344911], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"acea11f2f9f57698e1387427965a0caa","cause":{"type":"version_conflict_engine_exception","reason":"[acea11f2f9f57698e1387427965a0caa]: version conflict, required seqNo [245344912], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"8307537b9acd4b06374f6070e8a895e0","cause":{"type":"version_conflict_engine_exception","reason":"[8307537b9acd4b06374f6070e8a895e0]: version conflict, required seqNo [245344913], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ec9d51be3333f2a283ddaeb8c3938a53","cause":{"type":"version_conflict_engine_exception","reason":"[ec9d51be3333f2a283ddaeb8c3938a53]: version conflict, required seqNo [245344914], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"3bcadd65f06b0bcdcef8ed61f9dd5abf","cause":{"type":"version_conflict_engine_exception","reason":"[3bcadd65f06b0bcdcef8ed61f9dd5abf]: version conflict, required seqNo [245344915], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e3f1cfb4aaf087054d24637bbbfdd2d3","cause":{"type":"version_conflict_engine_exception","reason":"[e3f1cfb4aaf087054d24637bbbfdd2d3]: version conflict, required seqNo [245344916], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"d397c0497cb0b2dd531869554e5533ae","cause":{"type":"version_conflict_engine_exception","reason":"[d397c0497cb0b2dd531869554e5533ae]: version conflict, required seqNo [245344917], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"74e070ed3f2f7b75823ec003a223442c","cause":{"type":"version_conflict_engine_exception","reason":"[74e070ed3f2f7b75823ec003a223442c]: version conflict, required seqNo [245344918], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ab9776451734163b42f7c42bf218dd9f","cause":{"type":"version_conflict_engine_exception","reason":"[ab9776451734163b42f7c42bf218dd9f]: version conflict, required seqNo [245344919], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"f14ef8a47f9ea8eb44a173ccca83fd41","cause":{"type":"version_conflict_engine_exception","reason":"[f14ef8a47f9ea8eb44a173ccca83fd41]: version conflict, required seqNo [245344920], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"15f499ea063dc4bddaca8c9f9f6b9724","cause":{"type":"version_conflict_engine_exception","reason":"[15f499ea063dc4bddaca8c9f9f6b9724]: version conflict, required seqNo [245344921], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"4b49eec50ed2896ac1c4919589cc5f5b","cause":{"type":"version_conflict_engine_exception","reason":"[4b49eec50ed2896ac1c4919589cc5f5b]: version conflict, required seqNo [245344922], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"045052a46695652a3b6c2c64163d0912","cause":{"type":"version_conflict_engine_exception","reason":"[045052a46695652a3b6c2c64163d0912]: version conflict, required seqNo [245344923], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ba485b3e43190b4fbecb79603bf0c3f8","cause":{"type":"version_conflict_engine_exception","reason":"[ba485b3e43190b4fbecb79603bf0c3f8]: version conflict, required seqNo [245344924], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"655636581f97cac1682c91bb08dcda77","cause":{"type":"version_conflict_engine_exception","reason":"[655636581f97cac1682c91bb08dcda77]: version conflict, required seqNo [245344925], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"b025b5b059fa6b209f45b6c5ca1a6e89","cause":{"type":"version_conflict_engine_exception","reason":"[b025b5b059fa6b209f45b6c5ca1a6e89]: version conflict, required seqNo [245344926], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"5465661039542aae9922abcdf514dec1","cause":{"type":"version_conflict_engine_exception","reason":"[5465661039542aae9922abcdf514dec1]: version conflict, required seqNo [245344927], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"46907a2dabc7884be73724fa4b5db5be","cause":{"type":"version_conflict_engine_exception","reason":"[46907a2dabc7884be73724fa4b5db5be]: version conflict, required seqNo [245344928], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"0b9b83f200b6fd339997fb933a47a42e","cause":{"type":"version_conflict_engine_exception","reason":"[0b9b83f200b6fd339997fb933a47a42e]: version conflict, required seqNo [245344929], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e98695743107dffc2df1f44988738ec3","cause":{"type":"version_conflict_engine_exception","reason":"[e98695743107dffc2df1f44988738ec3]: version conflict, required seqNo [245344930], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"3f38a7a7267949c7d633d5c229621d8b","cause":{"type":"version_conflict_engine_exception","reason":"[3f38a7a7267949c7d633d5c229621d8b]: version conflict, required seqNo [245344931], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"184f849f6aea859fed6856e936e56a98","cause":{"type":"version_conflict_engine_exception","reason":"[184f849f6aea859fed6856e936e56a98]: version conflict, required seqNo [245344932], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"957681256267c23167c210afe1ac8f64","cause":{"type":"version_conflict_engine_exception","reason":"[957681256267c23167c210afe1ac8f64]: version conflict, required seqNo [245344982], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"bc9c6928d871c84918aab40aa965de88","cause":{"type":"version_conflict_engine_exception","reason":"[bc9c6928d871c84918aab40aa965de88]: version conflict, required seqNo [245344983], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"afe5ab8073ecb43e8e378ffac9c10ad5","cause":{"type":"version_conflict_engine_exception","reason":"[afe5ab8073ecb43e8e378ffac9c10ad5]: version conflict, required seqNo [245344984], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"0df64190bdb127803dbc0bb0c0b89ceb","cause":{"type":"version_conflict_engine_exception","reason":"[0df64190bdb127803dbc0bb0c0b89ceb]: version conflict, required seqNo [245344985], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a457ebcb5e939d96c2c9ff1a02956c44","cause":{"type":"version_conflict_engine_exception","reason":"[a457ebcb5e939d96c2c9ff1a02956c44]: version conflict, required seqNo [245344986], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"73ec15621d4d3c38db2bcde3231b6a8a","cause":{"type":"version_conflict_engine_exception","reason":"[73ec15621d4d3c38db2bcde3231b6a8a]: version conflict, required seqNo [245344987], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"5e148981d57d9322afdd4bed9d724c7b","cause":{"type":"version_conflict_engine_exception","reason":"[5e148981d57d9322afdd4bed9d724c7b]: version conflict, required seqNo [245344988], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"2647b0498c6c90f1b2bfee309c06e4d1","cause":{"type":"version_conflict_engine_exception","reason":"[2647b0498c6c90f1b2bfee309c06e4d1]: version conflict, required seqNo [245344989], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"032d1bb8f3c4a6a28296e37b5eefc666","cause":{"type":"version_conflict_engine_exception","reason":"[032d1bb8f3c4a6a28296e37b5eefc666]: version conflict, required seqNo [245344990], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a65e6d244a793273c5e70d6971f57d22","cause":{"type":"version_conflict_engine_exception","reason":"[a65e6d244a793273c5e70d6971f57d22]: version conflict, required seqNo [245344991], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"9a1b46e5d6a9f47b6395508e2534291b","cause":{"type":"version_conflict_engine_exception","reason":"[9a1b46e5d6a9f47b6395508e2534291b]: version conflict, required seqNo [245344992], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"00b869a6e764b7798a79581fd96fba5e","cause":{"type":"version_conflict_engine_exception","reason":"[00b869a6e764b7798a79581fd96fba5e]: version conflict, required seqNo [245344993], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"8ab8e087a23142158ff5d5d8abb17ba0","cause":{"type":"version_conflict_engine_exception","reason":"[8ab8e087a23142158ff5d5d8abb17ba0]: version conflict, required seqNo [245344994], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"5a284e8a7f31388e6ef4f7da7730c481","cause":{"type":"version_conflict_engine_exception","reason":"[5a284e8a7f31388e6ef4f7da7730c481]: version conflict, required seqNo [245344995], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"3c3bd77f7611d6ea09ae06faabe61cdc","cause":{"type":"version_conflict_engine_exception","reason":"[3c3bd77f7611d6ea09ae06faabe61cdc]: version conflict, required seqNo [245344996], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"88726625eb0c0d1768d55088123560ec","cause":{"type":"version_conflict_engine_exception","reason":"[88726625eb0c0d1768d55088123560ec]: version conflict, required seqNo [245344997], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"43f011e9f335b7775e423948c9ad2735","cause":{"type":"version_conflict_engine_exception","reason":"[43f011e9f335b7775e423948c9ad2735]: version conflict, required seqNo [245344998], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"f8d01bf0481b221c16d048b7c1c43512","cause":{"type":"version_conflict_engine_exception","reason":"[f8d01bf0481b221c16d048b7c1c43512]: version conflict, required seqNo [245344882], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"b75c3f2898484cb87ee44fd38ebcda1d","cause":{"type":"version_conflict_engine_exception","reason":"[b75c3f2898484cb87ee44fd38ebcda1d]: version conflict, required seqNo [245344883], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"b29ed46cd1ef412e29b8eaaa7429bfec","cause":{"type":"version_conflict_engine_exception","reason":"[b29ed46cd1ef412e29b8eaaa7429bfec]: version conflict, required seqNo [245344884], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"9e09cc4b7ecd7d0e8e3a6513262ad6c7","cause":{"type":"version_conflict_engine_exception","reason":"[9e09cc4b7ecd7d0e8e3a6513262ad6c7]: version conflict, required seqNo [245344885], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"9ba16cd9a4949e4b5e1ff20788afe523","cause":{"type":"version_conflict_engine_exception","reason":"[9ba16cd9a4949e4b5e1ff20788afe523]: version conflict, required seqNo [245344886], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a88236f3d2c2f9767f2f40c2cb2698ef","cause":{"type":"version_conflict_engine_exception","reason":"[a88236f3d2c2f9767f2f40c2cb2698ef]: version conflict, required seqNo [245344887], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"7e789235882b9c7095f342b15cb07660","cause":{"type":"version_conflict_engine_exception","reason":"[7e789235882b9c7095f342b15cb07660]: version conflict, required seqNo [245344888], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"b31f41feaf5f758560e61c9404c42e61","cause":{"type":"version_conflict_engine_exception","reason":"[b31f41feaf5f758560e61c9404c42e61]: version conflict, required seqNo [245344889], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"b57fbf944cad1f1a5371740f103af985","cause":{"type":"version_conflict_engine_exception","reason":"[b57fbf944cad1f1a5371740f103af985]: version conflict, required seqNo [245344890], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"5979db0efadb7ee067b1637ff2b4c8b4","cause":{"type":"version_conflict_engine_exception","reason":"[5979db0efadb7ee067b1637ff2b4c8b4]: version conflict, required seqNo [245344867], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"43ce1018816e4d38238426e899dbcd49","cause":{"type":"version_conflict_engine_exception","reason":"[43ce1018816e4d38238426e899dbcd49]: version conflict, required seqNo [245344868], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"fba7dc5808ec854fea57559e35e781b4","cause":{"type":"version_conflict_engine_exception","reason":"[fba7dc5808ec854fea57559e35e781b4]: version conflict, required seqNo [245344869], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"34bbd33bc51ce112c85408520111409e","cause":{"type":"version_conflict_engine_exception","reason":"[34bbd33bc51ce112c85408520111409e]: version conflict, required seqNo [245344870], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"472f8de2bb476984eac9777e808ddded","cause":{"type":"version_conflict_engine_exception","reason":"[472f8de2bb476984eac9777e808ddded]: version conflict, required seqNo [245344871], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e54f31965b2c0fe5145f0a4287662897","cause":{"type":"version_conflict_engine_exception","reason":"[e54f31965b2c0fe5145f0a4287662897]: version conflict, required seqNo [245344872], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"590d055f16c49564f80c9e05374192c1","cause":{"type":"version_conflict_engine_exception","reason":"[590d055f16c49564f80c9e05374192c1]: version conflict, required seqNo [245344873], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"f1c6a63ca2ae99c9823b0aba40e01ea4","cause":{"type":"version_conflict_engine_exception","reason":"[f1c6a63ca2ae99c9823b0aba40e01ea4]: version conflict, required seqNo [245344874], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"d074cdc9cb54e5cf95a6ff1075aae01e","cause":{"type":"version_conflict_engine_exception","reason":"[d074cdc9cb54e5cf95a6ff1075aae01e]: version conflict, required seqNo [245344875], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"5385c877898d911757944d1f39295fd0","cause":{"type":"version_conflict_engine_exception","reason":"[5385c877898d911757944d1f39295fd0]: version conflict, required seqNo [245344876], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"6c47961ae1880ed3989c6ae3189cfaf2","cause":{"type":"version_conflict_engine_exception","reason":"[6c47961ae1880ed3989c6ae3189cfaf2]: version conflict, required seqNo [245344877], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e0835493d7a0f7e3e62a40ba85f78baf","cause":{"type":"version_conflict_engine_exception","reason":"[e0835493d7a0f7e3e62a40ba85f78baf]: version conflict, required seqNo [245344878], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a6e7f907ee9d672875ee0b20be4ceb75","cause":{"type":"version_conflict_engine_exception","reason":"[a6e7f907ee9d672875ee0b20be4ceb75]: version conflict, required seqNo [245344879], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"3d188dcfadaf681d47f5dd5f02bacf78","cause":{"type":"version_conflict_engine_exception","reason":"[3d188dcfadaf681d47f5dd5f02bacf78]: version conflict, required seqNo [245344880], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"82cc73ddcda49782f83b1ce097cbaf81","cause":{"type":"version_conflict_engine_exception","reason":"[82cc73ddcda49782f83b1ce097cbaf81]: version conflict, required seqNo [245344881], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"74dc9c74276bbcf6e082e917fa8336f6","cause":{"type":"version_conflict_engine_exception","reason":"[74dc9c74276bbcf6e082e917fa8336f6]: version conflict, required seqNo [245344967], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"4905c93388db640cef8d885c995a3c08","cause":{"type":"version_conflict_engine_exception","reason":"[4905c93388db640cef8d885c995a3c08]: version conflict, required seqNo [245344968], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"bb73ac4c05326e3aeec88ef9a7a3d875","cause":{"type":"version_conflict_engine_exception","reason":"[bb73ac4c05326e3aeec88ef9a7a3d875]: version conflict, required seqNo [245344969], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"1560eba1b58d16eadc419e9d3799b747","cause":{"type":"version_conflict_engine_exception","reason":"[1560eba1b58d16eadc419e9d3799b747]: version conflict, required seqNo [245344970], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"080ed4db4bbb615b85b18923b2a5c164","cause":{"type":"version_conflict_engine_exception","reason":"[080ed4db4bbb615b85b18923b2a5c164]: version conflict, required seqNo [245344971], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"f0bf1a7b8009be2a63d6918fba3ffdb5","cause":{"type":"version_conflict_engine_exception","reason":"[f0bf1a7b8009be2a63d6918fba3ffdb5]: version conflict, required seqNo [245344972], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"c02f124d9f91c7e0ece8c44736834e6a","cause":{"type":"version_conflict_engine_exception","reason":"[c02f124d9f91c7e0ece8c44736834e6a]: version conflict, required seqNo [245344973], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"ab03734b9727ce7ad6bed317e75ec7ed","cause":{"type":"version_conflict_engine_exception","reason":"[ab03734b9727ce7ad6bed317e75ec7ed]: version conflict, required seqNo [245344974], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"761b671d501bef4259b451c7c0ca46d0","cause":{"type":"version_conflict_engine_exception","reason":"[761b671d501bef4259b451c7c0ca46d0]: version conflict, required seqNo [245344975], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"1316c59b832e2b8cf38a931b7095c8d4","cause":{"type":"version_conflict_engine_exception","reason":"[1316c59b832e2b8cf38a931b7095c8d4]: version conflict, required seqNo [245344976], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"f8b510dbb5ade2ad6ca09af86e166468","cause":{"type":"version_conflict_engine_exception","reason":"[f8b510dbb5ade2ad6ca09af86e166468]: version conflict, required seqNo [245344957], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"e88350a4e0326eefd143dc2ed497e8f2","cause":{"type":"version_conflict_engine_exception","reason":"[e88350a4e0326eefd143dc2ed497e8f2]: version conflict, required seqNo [245344958], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"7eac14bfc9cffb6cae1dd3c8b53f79d7","cause":{"type":"version_conflict_engine_exception","reason":"[7eac14bfc9cffb6cae1dd3c8b53f79d7]: version conflict, required seqNo [245344959], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a45ce8eaf43d4d0cd409b24b9dce5fbc","cause":{"type":"version_conflict_engine_exception","reason":"[a45ce8eaf43d4d0cd409b24b9dce5fbc]: version conflict, required seqNo [245344960], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"a8b7f4989dc9a65952812d3eb2642efa","cause":{"type":"version_conflict_engine_exception","reason":"[a8b7f4989dc9a65952812d3eb2642efa]: version conflict, required seqNo [245344961], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"381aa10ca8ad8b6fba5a0a85f2ef90c8","cause":{"type":"version_conflict_engine_exception","reason":"[381aa10ca8ad8b6fba5a0a85f2ef90c8]: version conflict, required seqNo [245344962], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"d7fcd4282b34b6eec070cec0d30ccea7","cause":{"type":"version_conflict_engine_exception","reason":"[d7fcd4282b34b6eec070cec0d30ccea7]: version conflict, required seqNo [245344963], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"4f31a8b4a956e8fe7c4385fd4aaba4f9","cause":{"type":"version_conflict_engine_exception","reason":"[4f31a8b4a956e8fe7c4385fd4aaba4f9]: version conflict, required seqNo [245344964], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"bdd23417af1d6f4211a8266cb7e1def8","cause":{"type":"version_conflict_engine_exception","reason":"[bdd23417af1d6f4211a8266cb7e1def8]: version conflict, required seqNo [245344965], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"896b2bdf545bf8d904c900035b1b507b","cause":{"type":"version_conflict_engine_exception","reason":"[896b2bdf545bf8d904c900035b1b507b]: version conflict, required seqNo [245344966], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"402c122e5741d88db1e6d7e5b06639f7","cause":{"type":"version_conflict_engine_exception","reason":"[402c122e5741d88db1e6d7e5b06639f7]: version conflict, required seqNo [245345561], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"daf2b35adfbc2407e14e0f7e8a637586","cause":{"type":"version_conflict_engine_exception","reason":"[daf2b35adfbc2407e14e0f7e8a637586]: version conflict, required seqNo [245345562], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"63c1a8c3c619a78bac98fc5de6bc99fa","cause":{"type":"version_conflict_engine_exception","reason":"[63c1a8c3c619a78bac98fc5de6bc99fa]: version conflict, required seqNo [245345563], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"919f74a8c8dc44c3f77245b4f369a8bf","cause":{"type":"version_conflict_engine_exception","reason":"[919f74a8c8dc44c3f77245b4f369a8bf]: version conflict, required seqNo [245345564], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"6d6cdf1f77efb27291c07c15563bf86b","cause":{"type":"version_conflict_engine_exception","reason":"[6d6cdf1f77efb27291c07c15563bf86b]: version conflict, required seqNo [245345565], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"dc072b534c53d8086bec5cad4a99ee09","cause":{"type":"version_conflict_engine_exception","reason":"[dc072b534c53d8086bec5cad4a99ee09]: version conflict, required seqNo [245345566], primary term [1]. but no document was found","index_uuid":"cni6CVDNRruBZnwFeXWkEw","shard":"0","index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b"},"status":409},{"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","type":"_doc","id":"23678acc4a3d93282e44024d6dda262d","cause":

jkhenning commented 2 years ago

Actually, deleting documents might take a long time since it might require reindexing of large indices. However, these errors have no significance (you were quite correct about their nature) -we plan to change the way we delete document to make sure they don't appear in the future, but it shouldn't have any impact on your issue right now.

ES might be slow due to deletions, but that should be temporary and pass after some time

mctigger commented 2 years ago

Can you tell me whether it is normal that my machine will not have high CPU utilization? I have a 32 core machine and many ES processes seem to be running, but I have only some utilization on 2 processes and nearly no utilization overall.

mctigger commented 2 years ago

Now the server hangs again. The WebUI and looking at experiments is fine, but experiments on clearml-agents show as aborted, but the agents seem to be still running their experiments. Resetting experiments gives TransportErro 429. I am not able to shut down the container:

Stopping clearml-agent-services ... 
Stopping clearml-webserver      ... 
Stopping clearml-apiserver      ... error
Stopping clearml-fileserver     ... error
Stopping clearml-redis          ... error
Stopping clearml-mongo          ... error
Stopping clearml-elastic        ... error

ERROR: for clearml-agent-services  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)

ERROR: for clearml-webserver  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

[p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][_doc][i3s-rXwBsLuAVx5lT938], source[_na_]}], target allocation id: guUOHYi2SH2FlFzbwzKCIQ, primary term: 12 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5620f95c[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 495782]]
[2021-10-23 13:01:54,568] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms
[2021-10-23 13:01:54,578] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 4ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10', '_type': '_doc', '_id': 'jHs-rXwBsLuAVx5lUN0R', 'status': 429, 'error': {'type':..., extra_info=rejected execution of processing of [2232059][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][_doc][jHs-rXwBsLuAVx5lUN0R], source[_na_]}], target allocation id: guUOHYi2SH2FlFzbwzKCIQ, primary term: 12 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5620f95c[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 495782]]
[2021-10-23 13:01:56,293] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms

sudo docker logs clearml-elastic also seems to hang.

jkhenning commented 2 years ago

Can you tell me whether it is normal that my machine will not have high CPU utilization?

Well, normally you should have low utilization - it should only increase when doing heavy reindexing. BTW, even though you have many cores, I don't think you should see many ES processes (java processes).

Now the server hangs again. The WebUI and looking at experiments is fine, but experiments on clearml-agents show as aborted, but the agents seem to be still running their experiments

I'm not sure I understand how does the server hang is evident - if the UI is fine and you can see experiment details (including metrics and logs) it would indicate both the api service, mongodb, redis and ES are working and are not stuck...

mctigger commented 2 years ago

Actually, deleting documents might take a long time since it might require reindexing of large indices. However, these errors have no significance (you were quite correct about their nature) -we plan to change the way we delete document to make sure they don't appear in the future, but it shouldn't have any impact on your issue right now.

ES might be slow due to deletions, but that should be temporary and pass after some time

What is the "maximum" intended size of experiments? I currently log everything which leads to ~1 million scalars per metric per experiment. Could it be that this is just too much for ES?

jkhenning commented 2 years ago

Well, I think the question should be "is it too much for elastic in the current configuration"? Well, I'm not sure - single instance ES can easily handle multiple indices with sizes reaching up to 40GB per shard (single node configurations usually allocate 1 shard per index) - what is the size of your indices? How many do you have?

mctigger commented 2 years ago

How can I find out? I use the default server config!

jkhenning commented 2 years ago

It's a matter of how much data was accumulated. From within the server, you can just use a cURL command to get the list of indices and their sizes: curl http://localhost:9200/_cat/indices?v=true You can share the result here.

mctigger commented 2 years ago

curl http://localhost:9200/_cat/indices?v=true

gives curl: (7) Failed to connect to localhost port 9200: Connection refused

with the default config.

jkhenning commented 2 years ago

Oh, this means the 9200 port is not exposed outside of the docker network. Just exec into the container and do it from there:

sudo docker exec -it clearml-elastic /bin/bash
curl http://localhost:9200/_cat/indices?v=true

mctigger commented 2 years ago

Now the errors happen again when I try to reset an experiment.

General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11', '_type': '_doc', '_id': '68kt8HwBbR9o1KTiQXXr', 'status': 429, 'error': {'type':..., extra_info=rejected execution of processing of [2347936][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11][_doc][68kt8HwBbR9o1KTiQXXr], source[_na_]}], target allocation id: F31o2Z0rSDigJV7nDYj4pA, primary term: 3 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@556e6e09[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 1085459]]

I also tried:

curl http://localhost:9200/_cat/indices?v=true

Worked when I tested it when no error happened, but now it just hangs...

jkhenning commented 2 years ago

Are you sure this is the response for curl http://localhost:9200/_cat/indices?v=true? It looks like an error related to indexing, not category listing...

mctigger commented 2 years ago

Sorry, I was not clear. I cannot run the command, because it just hangs. I also run a server instance on a different machine now to make sure it is not a problem with the machine!

jkhenning commented 2 years ago

@mctigger are you sure you're running the command from within the docker container? did you do sudo docker exec -it clearml-elastic /bin/bash before and issuing this command?

mctigger commented 2 years ago

Yes. I restarted the machine and now I get

health status index                                                         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10        fxkBOhKtT7yRZFU2UOIG4Q   1   1    2136853            0    111.9mb        111.9mb
yellow open   queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11        zNL_qi3QSXaY_qNZIeYw3g   1   1     520100            0     27.9mb         27.9mb
yellow open   events-log-d1bd92a3b039400cbafc60a7a5b1e52b                   ycwVPk_8TLWz08dkk5qaIQ   1   1     137548       274124    157.2mb        157.2mb
yellow open   events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b aSO_YPE9SCy3cFTZvpd3Aw   1   1   34018559      6226832      4.4gb          4.4gb
yellow open   events-plot-                                                  ZPEyc7hyS9GZdpSbK3D9WQ   1   1        173            0    444.8kb        444.8kb
yellow open   worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10         rdkiC2iEQ6qeowVKugz_GA   1   1    2187632            0    139.3mb        139.3mb
yellow open   worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11         IcHTR-8MQ9uT6tRmKYQp3g   1   1     743404            0     49.2mb         49.2mb
yellow open   events-training_debug_image-                                  VDfyV2F1Q-6c-D-GVsCBVA   1   1        189            0     78.2kb         78.2kb
yellow open   events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b  Tt9riC2PQ2yk3OHZ6IlEsQ   1   1       9902         2740      4.1mb          4.1mb
yellow open   events-log-                                                   1KAqVt3rSmCVuiLzfRgqOQ   1   1       2126            0      545kb          545kb
yellow open   events-plot-d1bd92a3b039400cbafc60a7a5b1e52b                  YcH5IavKSJiauxWqGE5Tvw   1   1        257           78      4.6mb          4.6mb
yellow open   events-training_stats_scalar-                                 nT4PSjqfTTSKXsu9igTmqw   1   1       7174            0    979.4kb        979.4kb

jkhenning commented 2 years ago

And how many GB are you allocating for ES in the ES settings?

mctigger commented 2 years ago

I tried nearly any configuration for ES. Default, 2GB, 16GB, 32GB. I also can confirm that on my second machine the same issues happen, however it does not seem to be related to a delete operation. I use 8 clearml-agents. Each generates ~1 million scalar metric datapoints per hour. My machines are all 32 core, 128GB RAM. I also run clearml-agents on the server-machine, since the server does not use the GPUs. The machine is not reachable via ssh anymore. It just hangs. Never had this problem without a clearml-server. So it must be related to clearml-server somehow. However, I can still use the web-interface, just no operation other than viewing will work.

jkhenning commented 2 years ago

The machine is not reachable via ssh anymore. It just hangs.

This kind of behavior in linux usually happens when you've reached 100% disk usage - sounds more system-related than specifically ClearML-server related. If you're storing lots of data and the ClearML server database grow, that will of course increase disk usage (server logs will, as well) - that's the sort of thing that should always be monitored.

mctigger commented 2 years ago

The server starts to react again (at least I can login now, however deleting experiments/resetting etc does still not work). I get TransportError 429 again. The clearml-agents are still running, but not listed in the WebUI and all the experiments have not been updated for hours. But I can see the processes running on the machine. There is plenty disk space available. 64GB RAM is free, too. I used the default config ES_JAVA_OPTS: -Xms2g -Xmx2g for ES. To me it seems like to be a problem with clearml-server or something is misconfigured.

I have another idea: I try to create an experiment similar to mine metric-wise and will try to crash the demo-server. Then we know for sure whether it is machine-related or not.

Below you can see the used disk space and the space used for /opt/clearml:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
tmpfs            13G  3,2M   13G   1% /run
/dev/nvme0n1p3  1,8T  262G  1,5T  16% /
tmpfs            63G  1,6G   62G   3% /dev/shm
tmpfs           5,0M  4,0K  5,0M   1% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/loop1       56M   56M     0 100% /snap/core18/2128
/dev/loop4      219M  219M     0 100% /snap/gnome-3-34-1804/72
/dev/loop3       66M   66M     0 100% /snap/gtk-common-themes/1515
/dev/nvme0n1p2  953M  7,9M  945M   1% /boot/efi
/dev/loop5      219M  219M     0 100% /snap/gnome-3-34-1804/66
/dev/loop8       51M   51M     0 100% /snap/snap-store/547
/dev/loop9       51M   51M     0 100% /snap/snap-store/542
tmpfs            13G   20K   13G   1% /run/user/125
tmpfs            13G  4,0K   13G   1% /run/user/1001
/dev/loop7      128K  128K     0 100% /snap/bare/5
/dev/loop11      66M   66M     0 100% /snap/gtk-common-themes/1519
/dev/loop6       33M   33M     0 100% /snap/snapd/13640
/dev/loop10      56M   56M     0 100% /snap/core18/2246
tmpfs            13G  4,0K   13G   1% /run/user/1000
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/14cceeb46f8c41d266c7a7e9d9a06cc6eb3eb8dab282589a58ebb2dec440c48b/merged
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/db7bae9248cb8b421102ccd3cb2133786dc5cbccf9fa3a012e8bf82502fa41b7/merged
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/0064b73127a6473a9737f7d3818bc70fbdeb02d91ed554b1bf35c74cf519a523/merged
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/3763f000a8b2be8574ec9feef7f451d221bd22a753262e9e48869bc96f1c90ed/merged
shm              64M     0   64M   0% /var/lib/docker/containers/5dc57cfaafdd0069dedccc8278eaa9d4c7c83f9cfa0df0f7994f255b42f24355/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/f5020c175f3b4207d260e4c62c5ac9c8341ebe657970d7b7f5ededa30b75f200/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/b22a2c1ace9cb74072bd3e6b675fef78b05962e3d3e718af963f0cfe42599647/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/269cd814a68f7befe8a83b67ed5d6c4e41cc0acb4448d15d25750272015d4f9e/mounts/shm
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/4e316d9d5f286c4e3269e532061094fc1b5614fbe5402e8d1b66c437c3827114/merged
shm              64M     0   64M   0% /var/lib/docker/containers/4134d55d2c179cc3851038c7685090e6dbf8acce59f9f759c0be34259b66f901/mounts/shm
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/43d5b9329ddef32d36a080752aaa4dcc6ab1d255f4a370c778a0e19c6602e812/merged
shm              64M     0   64M   0% /var/lib/docker/containers/254c060b44590dc7d41897d4fe9cb8a813db8dbffde1ae3da18d5b198a33859c/mounts/shm
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/4a4737d7fc0942504ce4e1b3b4e79f9ac23cd2cf4120cd45e429666e92ed1914/merged
shm              64M     0   64M   0% /var/lib/docker/containers/67b625acadaec5c6c0b10980a9f56bb7a412be13dd9a33a35b68d96b61590e19/mounts/shm
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/dd11d91cd435910fe54fdd53263b519d6a8438d47bbab8311f8f2f58345c7009/merged
overlay         1,8T  262G  1,5T  16% /var/lib/docker/overlay2/518a8c442bb6aeb6228cf8ce5dee0ef2dfa9c91797c71235a2a10c3c59ddbba1/merged
/dev/loop0       43M   43M     0 100% /snap/snapd/13831

8,0K    ./config
32K ./agent
22G ./data
38M ./logs
22G .

jkhenning commented 2 years ago

Basically TransportError 429 from ES means that ES is not able to keep up with the number of requests and that the internal queues are filling up - I think you should use a value higher than ES_JAVA_OPTS: -Xms2g -Xmx2g. We run many servers with 8GB RAM and 4 CPUs (with much larger indices, so ES works considerably harder there), and I've never seen such a behavior - I find it very hard to believe this is a ClearML Server configuration issue if you're using the default configuration in most cases... Perhaps you are experiencing disk stability issues?

mctigger commented 2 years ago

Mhhm, super hard to debug. I have tried on 2 different machines with the same config. I find it highly unlikely that anything is wrong with the hardware. I also tried way higher value for the ES_JAVA_OPTS as I wrote before, no differenz. And it is just a simple Ubuntu 20.04 installation without any changes aside from clearml-agent and clearml-server.

But nobody writes here except me, so for now I have to assume it is just me and there must be something wrong with either my clearml-server machines or the way I use clearml.

jkhenning commented 2 years ago

It's really a conundrum... and super-frustrating - I'd really like to get to the bottom of this 🙁 Can you maybe try to list everything you can regarding the environment/setup again? I feel like we're missing something here.

mctigger commented 2 years ago

So I think I have found the reason for the instability in my case. User error most probably (shame on me). I run clearml-agents on the same machine as my clearml-server. The agents are run in docker mode with docker arguments "--memory=48g", "--shm-size=48g". I assumed that if the container tries to use more than 48g of memory, it will just stop and not influence the server. However, these settings still allow the containers to use SWAP! My machine are configured with very low amount of swap since it never should be used. Most probably the swap being full lead to some kind of problem. (Even though there should have been free RAM still). I added "--memory-swap=48g" to the docker arguments, which makes the containers not allocated any swap (see docker doc). Now the clearml-server runs stable, I only get the known errors when I delete large experiments, because ES is slow, but this is to be expected.

Just for completeness, because I do not want to change my setup to validate my reasoning: I also switched from tensorboardx to pytorch.util.tensorboard. This is the only other change I did.

Sorry for wasting your time, but I am still happy the server seems to run stable now :)

EDIT: Unfortunately, this was not the solution. Only thing I can say for sure is that without clearml-agent on the same machine, it works fine.

luvwinnie commented 7 months ago

I'm facing the same problem with this. Is it mean by adding swap memory can fix this problem? If so, which services I should add to? the apiserver of the clearml?

clearml.log - WARNING - failed logging task to backend (41 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)

mctigger commented 7 months ago

I found that this problem only occurred when other docker container than the clearml-server ones were running. Try running only clearml-server docker compose with nothing else and monitor if it helps :/

allegroai / clearml-server

Large experiments cause instability #87