Closed mctigger closed 2 years ago
Hן @mctigger ,
This sounds like indexing issues in the ES service - probably something that takes a long time and thus returns a timeout to the server.
When it happens, can you get some logs from the apiserver and elasticsearch services? Use sudo docker logs clearml-apiserver
and sudo docker logs clearml-elastic
The following experiment failed to reset
sudo docker logs clearml-apiserver
Here are the only errors in this log. But the error happened 2 hours earlier.
[2021-09-22 23:57:58,758] [9] [WARNING] [elasticsearch] POST http://elasticsearch:9200/events-*-d1bd92a3b039400cbafc60a7a5b1e52b/_delete_by_query?refresh=true [status:N/A request:60.050s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
response.begin()
File "/usr/lib64/python3.6/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib64/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
method, url, body, retries=Retry(False), headers=request_headers, **kw
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 770, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60)
[2021-09-22 23:57:58,791] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 208ms
[2021-09-22 23:57:58,823] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 3ms
[2021-09-22 23:57:58,824] [9] [WARNING] [elasticsearch] POST http://elasticsearch:9200/events-*-d1bd92a3b039400cbafc60a7a5b1e52b/_delete_by_query?refresh=true [status:409 request:0.065s]
sudo docker logs clearml-elastic
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
{"type": "server", "timestamp": "2021-09-22T10:17:25,047Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/nvme0n1p2)]], net usable_space [1.3tb], net total_space [1.7tb], types [ext4]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,048Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "heap size [15.6gb], compressed ordinary object pointers [true]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "node name [clearml], node ID [JUV7hB8TQKyhmdMIk_TmaA], cluster name [clearml]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "version[7.6.2], pid[1], build[default/docker/ef48eb35cf30adf4db14086e8aabd07ef6fb113f/2020-03-26T06:34:37.794943Z], OS[Linux/5.11.0-27-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.2/13.0.2+8]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM home [/usr/share/elasticsearch/jdk]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,142Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-10421376390742623811, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/, -Xms16g, -Xmx16g, -XX:MaxDirectMemorySize=8589934592, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,964Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [aggs-matrix-stats]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,964Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [analysis-common]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [flattened]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [frozen-indices]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-common]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-geoip]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-user-agent]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-expression]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-mustache]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-painless]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [mapper-extras]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,965Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [parent-join]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [percolator]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [rank-eval]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [reindex]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [repository-url]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [search-business-rules]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [spatial]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transform]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,966Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transport-netty4]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [vectors]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-analytics]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ccr]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-core]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-deprecation]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-enrich]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-graph]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ilm]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-logstash]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,967Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ml]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-monitoring]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-rollup]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-security]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-sql]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-voting-only-node]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-watcher]" }
{"type": "server", "timestamp": "2021-09-22T10:17:25,968Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "no plugins loaded" }
{"type": "deprecation", "timestamp": "2021-09-22T10:17:27,055Z", "level": "WARN", "component": "o.e.d.c.s.Settings", "cluster.name": "clearml", "node.name": "clearml", "message": "[discovery.zen.minimum_master_nodes] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version." }
{"type": "server", "timestamp": "2021-09-22T10:17:28,347Z", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "clearml", "node.name": "clearml", "message": "[controller/127] [Main.cc@110] controller (64 bit): Version 7.6.2 (Build e06ef9d86d5332) Copyright (c) 2020 Elasticsearch BV" }
{"type": "server", "timestamp": "2021-09-22T10:17:28,571Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "clearml", "node.name": "clearml", "message": "using discovery type [single-node] and seed hosts providers [settings]" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,028Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "initialized" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,028Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "starting ..." }
{"type": "server", "timestamp": "2021-09-22T10:17:29,282Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.22.0.4:9300}, bound_addresses {0.0.0.0:9300}" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,485Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "clearml", "node.name": "clearml", "message": "cluster UUID [hK8xFPPDSqqlBHbO-CN4kw]" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,669Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "clearml", "node.name": "clearml", "message": "elected-as-master ([1] nodes joined)[{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{BLuP1kWDTZib1rI5NW35iw}{172.22.0.4}{172.22.0.4:9300}{dilm}{ml.machine_memory=134994567168, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 4, version: 65, delta: master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{BLuP1kWDTZib1rI5NW35iw}{172.22.0.4}{172.22.0.4:9300}{dilm}{ml.machine_memory=134994567168, xpack.installed=true, ml.max_open_jobs=20}]}" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,720Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "clearml", "node.name": "clearml", "message": "master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{BLuP1kWDTZib1rI5NW35iw}{172.22.0.4}{172.22.0.4:9300}{dilm}{ml.machine_memory=134994567168, xpack.installed=true, ml.max_open_jobs=20}]}, term: 4, version: 65, reason: Publication{term=4, version=65}" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,746Z", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.22.0.4:9200}, bound_addresses {0.0.0.0:9200}", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,746Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "started", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,831Z", "level": "INFO", "component": "o.e.l.LicenseService", "cluster.name": "clearml", "node.name": "clearml", "message": "license [3694cc4d-2664-43fc-8763-57b48da7f0c9] mode [basic] - valid", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:29,832Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "clearml", "node.name": "clearml", "message": "recovered [10] indices into cluster_state", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:31,639Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "clearml", "node.name": "clearml", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:54,335Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events] for index patterns [events-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:54,362Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_log] for index patterns [events-log-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:54,389Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_plot] for index patterns [events-plot-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:54,415Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_training_debug_image] for index patterns [events-training_debug_image-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:54,449Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [worker_stats] for index patterns [worker_stats_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
{"type": "server", "timestamp": "2021-09-22T10:17:54,473Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [queue_metrics] for index patterns [queue_metrics_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
Just tried again resetting and it worked. So just I like for deletion, the errors seem to fix themselves after a while...
This looks like it takes time for ES to reindex and remove the deleted documents from the index, during which time any requests from the server may result in these errors... I think it's just a matter of hiding these errors as reindex times are load/performance related...
Hi @mctigger, it seems that massive delete operations create a heavy load on Elasticsearch. I would try increasing the memory allocated to Elasticsearch jvm if you did not do it already. The default setting that we have in the docker-compose for elasticsearch service is pretty conservative ES_JAVA_OPTS: -Xms2g -Xmx2g It allows ES to use up to 2Gb of RAM. I would increase these numbers to up to the half of your available RAM (but not more than 32Gb)
I already had it set to use 16G of RAM. Everything else is set to default.
After restarting the server because it became unresponsive I cannot do any operations on my experiments. However, viewing them is still possible.
Here is an example error message. I tried to enqueue a task:
General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10', '_type': '_doc', '_id': 'T-YUmHwB6OL9GSKaKE15', 'status': 429, 'error': {'type':..., extra_info=rejected execution of processing of [1079239][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][_doc][T-YUmHwB6OL9GSKaKE15], source[na]}], target allocation id: guUOHYi2SH2FlFzbwzKCIQ, primary term: 4 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@310e6893[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 409896]]
Or when trying to reset a task: https://gist.github.com/mctigger/b4b5791b758c4ec651f9fb180be0f19e
Hi @mctigger,
This seems like an ES index error, can you share the ES log (using sudo docker logs clearml-elastic
)?
Here it is (I increased memory to 64GB for elastic):
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. {"type": "server", "timestamp": "2021-10-19T11:48:05,276Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/nvme0n1p2)]], net usable_space [1tb], net total_space [1.7tb], types [ext4]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,278Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "heap size [63.6gb], compressed ordinary object pointers [false]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,363Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "node name [clearml], node ID [JUV7hB8TQKyhmdMIk_TmaA], cluster name [clearml]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,363Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "version[7.6.2], pid[1], build[default/docker/ef48eb35cf30adf4db14086e8aabd07ef6fb113f/2020-03-26T06:34:37.794943Z], OS[Linux/5.11.0-37-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.2/13.0.2+8]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,364Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM home [/usr/share/elasticsearch/jdk]" } {"type": "server", "timestamp": "2021-10-19T11:48:05,364Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-1244715195998405414, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/, -Xms64g, -Xmx64g, -XX:MaxDirectMemorySize=34359738368, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,348Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [aggs-matrix-stats]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [analysis-common]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [flattened]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [frozen-indices]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-common]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-geoip]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [ingest-user-agent]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-expression]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,349Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-mustache]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [lang-painless]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [mapper-extras]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [parent-join]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [percolator]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [rank-eval]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [reindex]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [repository-url]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [search-business-rules]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [spatial]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,350Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transform]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [transport-netty4]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [vectors]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-analytics]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ccr]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-core]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-deprecation]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-enrich]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-graph]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ilm]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,351Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-logstash]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-ml]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-monitoring]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-rollup]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-security]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-sql]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-voting-only-node]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "loaded module [x-pack-watcher]" } {"type": "server", "timestamp": "2021-10-19T11:48:06,352Z", "level": "INFO", "component": "o.e.p.PluginsService", "cluster.name": "clearml", "node.name": "clearml", "message": "no plugins loaded" } {"type": "deprecation", "timestamp": "2021-10-19T11:48:08,877Z", "level": "WARN", "component": "o.e.d.c.s.Settings", "cluster.name": "clearml", "node.name": "clearml", "message": "[discovery.zen.minimum_master_nodes] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version." } {"type": "server", "timestamp": "2021-10-19T11:48:11,496Z", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "clearml", "node.name": "clearml", "message": "[controller/128] [Main.cc@110] controller (64 bit): Version 7.6.2 (Build e06ef9d86d5332) Copyright (c) 2020 Elasticsearch BV" } {"type": "server", "timestamp": "2021-10-19T11:48:11,847Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "clearml", "node.name": "clearml", "message": "using discovery type [single-node] and seed hosts providers [settings]" } {"type": "server", "timestamp": "2021-10-19T11:48:12,245Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "initialized" } {"type": "server", "timestamp": "2021-10-19T11:48:12,246Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "starting ..." } {"type": "server", "timestamp": "2021-10-19T11:48:12,356Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.20.0.3:9300}, bound_addresses {0.0.0.0:9300}" } {"type": "server", "timestamp": "2021-10-19T11:48:12,536Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "clearml", "node.name": "clearml", "message": "cluster UUID [hK8xFPPDSqqlBHbO-CN4kw]" } {"type": "server", "timestamp": "2021-10-19T11:48:12,586Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "clearml", "node.name": "clearml", "message": "elected-as-master ([1] nodes joined)[{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{79M0I9VCTeO0x5AVIfNuAA}{172.20.0.3}{172.20.0.3:9300}{dilm}{ml.machine_memory=134927069184, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 11, version: 171, delta: master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{79M0I9VCTeO0x5AVIfNuAA}{172.20.0.3}{172.20.0.3:9300}{dilm}{ml.machine_memory=134927069184, xpack.installed=true, ml.max_open_jobs=20}]}" } {"type": "server", "timestamp": "2021-10-19T11:48:12,639Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "clearml", "node.name": "clearml", "message": "master node changed {previous [], current [{clearml}{JUV7hB8TQKyhmdMIk_TmaA}{79M0I9VCTeO0x5AVIfNuAA}{172.20.0.3}{172.20.0.3:9300}{dilm}{ml.machine_memory=134927069184, xpack.installed=true, ml.max_open_jobs=20}]}, term: 11, version: 171, reason: Publication{term=11, version=171}" } {"type": "server", "timestamp": "2021-10-19T11:48:12,662Z", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "clearml", "node.name": "clearml", "message": "publish_address {172.20.0.3:9200}, bound_addresses {0.0.0.0:9200}", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:12,662Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "started", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:12,735Z", "level": "INFO", "component": "o.e.l.LicenseService", "cluster.name": "clearml", "node.name": "clearml", "message": "license [3694cc4d-2664-43fc-8763-57b48da7f0c9] mode [basic] - valid", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:12,736Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "clearml", "node.name": "clearml", "message": "recovered [12] indices into cluster_state", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:14,719Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "clearml", "node.name": "clearml", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]]]).", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,879Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events] for index patterns [events-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,911Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_log] for index patterns [events-log-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,939Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_plot] for index patterns [events-plot-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,965Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [events_training_debug_image] for index patterns [events-training_debug_image-*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:24,999Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [worker_stats] for index patterns [worker_stats_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:48:25,023Z", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "cluster.name": "clearml", "node.name": "clearml", "message": "adding template [queue_metrics] for index patterns [queue_metrics_*]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:37,352Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][265] overhead, spent [368ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:40,512Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][268] overhead, spent [371ms] collecting in the last [1.1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:44,514Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][272] overhead, spent [522ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:47,907Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][275] overhead, spent [504ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:51,908Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][279] overhead, spent [554ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:55,279Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][282] overhead, spent [430ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:52:59,280Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][286] overhead, spent [405ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:02,669Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][289] overhead, spent [446ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:06,670Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][293] overhead, spent [419ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:09,984Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][296] overhead, spent [452ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:13,985Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][300] overhead, spent [460ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:17,191Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][303] overhead, spent [445ms] collecting in the last [1.2s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:20,484Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][306] overhead, spent [405ms] collecting in the last [1.2s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:23,816Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][309] overhead, spent [414ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:27,132Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][312] overhead, spent [377ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:31,133Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][316] overhead, spent [410ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:34,134Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][319] overhead, spent [276ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:41,445Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][326] overhead, spent [308ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:44,446Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][329] overhead, spent [277ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:47,516Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][332] overhead, spent [306ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:51,517Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][336] overhead, spent [421ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:53:56,518Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][341] overhead, spent [256ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:54:13,716Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][358] overhead, spent [293ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:55:56,959Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][461] overhead, spent [352ms] collecting in the last [1.1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:02,253Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][466] overhead, spent [428ms] collecting in the last [1.2s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:07,280Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][471] overhead, spent [447ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:12,299Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][476] overhead, spent [466ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:17,675Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][481] overhead, spent [449ms] collecting in the last [1.3s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:22,735Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][486] overhead, spent [443ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:29,737Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][493] overhead, spent [433ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:43,740Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][507] overhead, spent [382ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" } {"type": "server", "timestamp": "2021-10-19T11:56:54,742Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "clearml", "node.name": "clearml", "message": "[gc][518] overhead, spent [264ms] collecting in the last [1s]", "cluster.uuid": "hK8xFPPDSqqlBHbO-CN4kw", "node.id": "JUV7hB8TQKyhmdMIk_TmaA" }
I don't see any error here, but the error should appear here...
From the timestamp it seems like this is only a very small part of the log, however with the command you provided this is the only part that shows.
Perhaps you've restarted the Elastic since? the command should output all of the logs...
Not really... I will create a fresh server, then I will not change anything until the error happens again and just send you all the logs.
Maybe my problems are related to ElasticSearch just being really slow? It seems I tried to delete a experiment and ES throws error since maybe some parts are already deleted, but the deletion is not yet finished? Shouldn't be such a delete near instant? I have nearly no CPU utilization, but many processes running in context to ES!
Actually, deleting documents might take a long time since it might require reindexing of large indices. However, these errors have no significance (you were quite correct about their nature) -we plan to change the way we delete document to make sure they don't appear in the future, but it shouldn't have any impact on your issue right now.
ES might be slow due to deletions, but that should be temporary and pass after some time
Can you tell me whether it is normal that my machine will not have high CPU utilization? I have a 32 core machine and many ES processes seem to be running, but I have only some utilization on 2 processes and nearly no utilization overall.
Now the server hangs again. The WebUI and looking at experiments is fine, but experiments on clearml-agents show as aborted, but the agents seem to be still running their experiments. Resetting experiments gives TransportErro 429. I am not able to shut down the container:
Stopping clearml-agent-services ...
Stopping clearml-webserver ...
Stopping clearml-apiserver ... error
Stopping clearml-fileserver ... error
Stopping clearml-redis ... error
Stopping clearml-mongo ... error
Stopping clearml-elastic ... error
ERROR: for clearml-agent-services UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: for clearml-webserver UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
[p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][_doc][i3s-rXwBsLuAVx5lT938], source[_na_]}], target allocation id: guUOHYi2SH2FlFzbwzKCIQ, primary term: 12 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5620f95c[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 495782]]
[2021-10-23 13:01:54,568] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms
[2021-10-23 13:01:54,578] [9] [ERROR] [clearml.service_repo] Returned 500 for queues.get_next_task in 4ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10', '_type': '_doc', '_id': 'jHs-rXwBsLuAVx5lUN0R', 'status': 429, 'error': {'type':..., extra_info=rejected execution of processing of [2232059][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10][_doc][jHs-rXwBsLuAVx5lUN0R], source[_na_]}], target allocation id: guUOHYi2SH2FlFzbwzKCIQ, primary term: 12 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5620f95c[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 495782]]
[2021-10-23 13:01:56,293] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms
sudo docker logs clearml-elastic
also seems to hang.
Can you tell me whether it is normal that my machine will not have high CPU utilization?
Well, normally you should have low utilization - it should only increase when doing heavy reindexing. BTW, even though you have many cores, I don't think you should see many ES processes (java processes).
Now the server hangs again. The WebUI and looking at experiments is fine, but experiments on clearml-agents show as aborted, but the agents seem to be still running their experiments
I'm not sure I understand how does the server hang is evident - if the UI is fine and you can see experiment details (including metrics and logs) it would indicate both the api service, mongodb, redis and ES are working and are not stuck...
Actually, deleting documents might take a long time since it might require reindexing of large indices. However, these errors have no significance (you were quite correct about their nature) -we plan to change the way we delete document to make sure they don't appear in the future, but it shouldn't have any impact on your issue right now.
ES might be slow due to deletions, but that should be temporary and pass after some time
What is the "maximum" intended size of experiments? I currently log everything which leads to ~1 million scalars per metric per experiment. Could it be that this is just too much for ES?
Well, I think the question should be "is it too much for elastic in the current configuration"? Well, I'm not sure - single instance ES can easily handle multiple indices with sizes reaching up to 40GB per shard (single node configurations usually allocate 1 shard per index) - what is the size of your indices? How many do you have?
How can I find out? I use the default server config!
It's a matter of how much data was accumulated.
From within the server, you can just use a cURL command to get the list of indices and their sizes:
curl http://localhost:9200/_cat/indices?v=true
You can share the result here.
curl http://localhost:9200/_cat/indices?v=true
gives
curl: (7) Failed to connect to localhost port 9200: Connection refused
with the default config.
Oh, this means the 9200 port is not exposed outside of the docker network. Just exec into the container and do it from there:
sudo docker exec -it clearml-elastic /bin/bash
curl http://localhost:9200/_cat/indices?v=true
Now the errors happen again when I try to reset an experiment.
General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11', '_type': '_doc', '_id': '68kt8HwBbR9o1KTiQXXr', 'status': 429, 'error': {'type':..., extra_info=rejected execution of processing of [2347936][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11][_doc][68kt8HwBbR9o1KTiQXXr], source[_na_]}], target allocation id: F31o2Z0rSDigJV7nDYj4pA, primary term: 3 on EsThreadPoolExecutor[name = clearml/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@556e6e09[Running, pool size = 64, active threads = 64, queued tasks = 200, completed tasks = 1085459]]
I also tried:
Worked when I tested it when no error happened, but now it just hangs...
Are you sure this is the response for curl http://localhost:9200/_cat/indices?v=true
? It looks like an error related to indexing, not category listing...
Sorry, I was not clear. I cannot run the command, because it just hangs. I also run a server instance on a different machine now to make sure it is not a problem with the machine!
@mctigger are you sure you're running the command from within the docker container? did you do sudo docker exec -it clearml-elastic /bin/bash
before and issuing this command?
Yes. I restarted the machine and now I get
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 fxkBOhKtT7yRZFU2UOIG4Q 1 1 2136853 0 111.9mb 111.9mb
yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 zNL_qi3QSXaY_qNZIeYw3g 1 1 520100 0 27.9mb 27.9mb
yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b ycwVPk_8TLWz08dkk5qaIQ 1 1 137548 274124 157.2mb 157.2mb
yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b aSO_YPE9SCy3cFTZvpd3Aw 1 1 34018559 6226832 4.4gb 4.4gb
yellow open events-plot- ZPEyc7hyS9GZdpSbK3D9WQ 1 1 173 0 444.8kb 444.8kb
yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 rdkiC2iEQ6qeowVKugz_GA 1 1 2187632 0 139.3mb 139.3mb
yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 IcHTR-8MQ9uT6tRmKYQp3g 1 1 743404 0 49.2mb 49.2mb
yellow open events-training_debug_image- VDfyV2F1Q-6c-D-GVsCBVA 1 1 189 0 78.2kb 78.2kb
yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b Tt9riC2PQ2yk3OHZ6IlEsQ 1 1 9902 2740 4.1mb 4.1mb
yellow open events-log- 1KAqVt3rSmCVuiLzfRgqOQ 1 1 2126 0 545kb 545kb
yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b YcH5IavKSJiauxWqGE5Tvw 1 1 257 78 4.6mb 4.6mb
yellow open events-training_stats_scalar- nT4PSjqfTTSKXsu9igTmqw 1 1 7174 0 979.4kb 979.4kb
And how many GB are you allocating for ES in the ES settings?
I tried nearly any configuration for ES. Default, 2GB, 16GB, 32GB. I also can confirm that on my second machine the same issues happen, however it does not seem to be related to a delete operation. I use 8 clearml-agents. Each generates ~1 million scalar metric datapoints per hour. My machines are all 32 core, 128GB RAM. I also run clearml-agents on the server-machine, since the server does not use the GPUs. The machine is not reachable via ssh anymore. It just hangs. Never had this problem without a clearml-server. So it must be related to clearml-server somehow. However, I can still use the web-interface, just no operation other than viewing will work.
The machine is not reachable via ssh anymore. It just hangs.
This kind of behavior in linux usually happens when you've reached 100% disk usage - sounds more system-related than specifically ClearML-server related. If you're storing lots of data and the ClearML server database grow, that will of course increase disk usage (server logs will, as well) - that's the sort of thing that should always be monitored.
The server starts to react again (at least I can login now, however deleting experiments/resetting etc does still not work). I get TransportError 429 again. The clearml-agents are still running, but not listed in the WebUI and all the experiments have not been updated for hours. But I can see the processes running on the machine. There is plenty disk space available. 64GB RAM is free, too. I used the default config ES_JAVA_OPTS: -Xms2g -Xmx2g
for ES. To me it seems like to be a problem with clearml-server or something is misconfigured.
I have another idea: I try to create an experiment similar to mine metric-wise and will try to crash the demo-server. Then we know for sure whether it is machine-related or not.
Below you can see the used disk space and the space used for /opt/clearml
:
$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 3,2M 13G 1% /run
/dev/nvme0n1p3 1,8T 262G 1,5T 16% /
tmpfs 63G 1,6G 62G 3% /dev/shm
tmpfs 5,0M 4,0K 5,0M 1% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/loop1 56M 56M 0 100% /snap/core18/2128
/dev/loop4 219M 219M 0 100% /snap/gnome-3-34-1804/72
/dev/loop3 66M 66M 0 100% /snap/gtk-common-themes/1515
/dev/nvme0n1p2 953M 7,9M 945M 1% /boot/efi
/dev/loop5 219M 219M 0 100% /snap/gnome-3-34-1804/66
/dev/loop8 51M 51M 0 100% /snap/snap-store/547
/dev/loop9 51M 51M 0 100% /snap/snap-store/542
tmpfs 13G 20K 13G 1% /run/user/125
tmpfs 13G 4,0K 13G 1% /run/user/1001
/dev/loop7 128K 128K 0 100% /snap/bare/5
/dev/loop11 66M 66M 0 100% /snap/gtk-common-themes/1519
/dev/loop6 33M 33M 0 100% /snap/snapd/13640
/dev/loop10 56M 56M 0 100% /snap/core18/2246
tmpfs 13G 4,0K 13G 1% /run/user/1000
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/14cceeb46f8c41d266c7a7e9d9a06cc6eb3eb8dab282589a58ebb2dec440c48b/merged
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/db7bae9248cb8b421102ccd3cb2133786dc5cbccf9fa3a012e8bf82502fa41b7/merged
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/0064b73127a6473a9737f7d3818bc70fbdeb02d91ed554b1bf35c74cf519a523/merged
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/3763f000a8b2be8574ec9feef7f451d221bd22a753262e9e48869bc96f1c90ed/merged
shm 64M 0 64M 0% /var/lib/docker/containers/5dc57cfaafdd0069dedccc8278eaa9d4c7c83f9cfa0df0f7994f255b42f24355/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/f5020c175f3b4207d260e4c62c5ac9c8341ebe657970d7b7f5ededa30b75f200/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/b22a2c1ace9cb74072bd3e6b675fef78b05962e3d3e718af963f0cfe42599647/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/269cd814a68f7befe8a83b67ed5d6c4e41cc0acb4448d15d25750272015d4f9e/mounts/shm
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/4e316d9d5f286c4e3269e532061094fc1b5614fbe5402e8d1b66c437c3827114/merged
shm 64M 0 64M 0% /var/lib/docker/containers/4134d55d2c179cc3851038c7685090e6dbf8acce59f9f759c0be34259b66f901/mounts/shm
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/43d5b9329ddef32d36a080752aaa4dcc6ab1d255f4a370c778a0e19c6602e812/merged
shm 64M 0 64M 0% /var/lib/docker/containers/254c060b44590dc7d41897d4fe9cb8a813db8dbffde1ae3da18d5b198a33859c/mounts/shm
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/4a4737d7fc0942504ce4e1b3b4e79f9ac23cd2cf4120cd45e429666e92ed1914/merged
shm 64M 0 64M 0% /var/lib/docker/containers/67b625acadaec5c6c0b10980a9f56bb7a412be13dd9a33a35b68d96b61590e19/mounts/shm
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/dd11d91cd435910fe54fdd53263b519d6a8438d47bbab8311f8f2f58345c7009/merged
overlay 1,8T 262G 1,5T 16% /var/lib/docker/overlay2/518a8c442bb6aeb6228cf8ce5dee0ef2dfa9c91797c71235a2a10c3c59ddbba1/merged
/dev/loop0 43M 43M 0 100% /snap/snapd/13831
8,0K ./config
32K ./agent
22G ./data
38M ./logs
22G .
Basically TransportError 429 from ES means that ES is not able to keep up with the number of
requests and that the internal queues are filling up - I think you should use a value higher than ES_JAVA_OPTS: -Xms2g -Xmx2g
.
We run many servers with 8GB RAM and 4 CPUs (with much larger indices, so ES works considerably harder there), and I've never seen such a behavior - I find it very hard to believe this is a ClearML Server configuration issue if you're using the default configuration in most cases... Perhaps you are experiencing disk stability issues?
Mhhm, super hard to debug. I have tried on 2 different machines with the same config. I find it highly unlikely that anything is wrong with the hardware. I also tried way higher value for the ES_JAVA_OPTS as I wrote before, no differenz. And it is just a simple Ubuntu 20.04 installation without any changes aside from clearml-agent and clearml-server.
But nobody writes here except me, so for now I have to assume it is just me and there must be something wrong with either my clearml-server machines or the way I use clearml.
It's really a conundrum... and super-frustrating - I'd really like to get to the bottom of this 🙁 Can you maybe try to list everything you can regarding the environment/setup again? I feel like we're missing something here.
So I think I have found the reason for the instability in my case. User error most probably (shame on me). I run clearml-agents on the same machine as my clearml-server. The agents are run in docker mode with docker arguments "--memory=48g", "--shm-size=48g"
. I assumed that if the container tries to use more than 48g of memory, it will just stop and not influence the server. However, these settings still allow the containers to use SWAP! My machine are configured with very low amount of swap since it never should be used. Most probably the swap being full lead to some kind of problem. (Even though there should have been free RAM still). I added "--memory-swap=48g"
to the docker arguments, which makes the containers not allocated any swap (see docker doc). Now the clearml-server runs stable, I only get the known errors when I delete large experiments, because ES is slow, but this is to be expected.
Just for completeness, because I do not want to change my setup to validate my reasoning: I also switched from tensorboardx to pytorch.util.tensorboard. This is the only other change I did.
Sorry for wasting your time, but I am still happy the server seems to run stable now :)
EDIT: Unfortunately, this was not the solution. Only thing I can say for sure is that without clearml-agent on the same machine, it works fine.
I'm facing the same problem with this. Is it mean by adding swap memory can fix this problem? If so, which services I should add to? the apiserver of the clearml?
clearml.log - WARNING - failed logging task to backend (41 lines, <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>)
I found that this problem only occurred when other docker container than the clearml-server ones were running. Try running only clearml-server docker compose with nothing else and monitor if it helps :/
This is similar to the problems that we discussed here https://clearml.slack.com/archives/CTK20V944/p1622050547263400, however still exist with the current clearml-server and can be reproduced by me:
This behavior does not happen with a fresh server. Deletion works fine even with large experiments. After some time the problem arises and stays until I restart the server (so far, with older clearml-server versions it persisted after restart).
I will add error logs to this issue in the future when the problems arise again. If someone has similar experiences please share here, so I can find out whether this is something that is specific to my machine.