Open b-deam opened 2 years ago
On macOS, it only reproduces when Rally must invoke a build, otherwise, the race succeeds:
uname -a
Darwin grape.lan 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64 x86_64
The behavior is consistent with the rally.ini configuration for the local repository cache; i.e., do not cache artifacts:
[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch
cache = false
During a successful benchmark, an stream of lsock
objects are created at the start and they look like this:
_acceptNewIncoming:
lsock=<socket.socket fd=29, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 52934), raddr=('192.168.5.116', 52967)>
Starting a run we know will fail (one without a cached ES artifact), lsock
starts like this:
_acceptNewIncoming:
lsock=<socket.socket fd=19, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 53029), raddr=('192.168.5.116', 53077)>
then ~85s later right before the failure point:
83.5s _acceptNewIncoming:
lsock=<socket.socket fd=27, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 53523)>
The repro and error are the same every time. I don't know yet what any of this means, but it has me thinking that either: 1) the ES build is getting in the way, or 2) there is something wrong in the actor system or how it is used.
1) the ES build is getting in the way,
You can stub out the build command with something simple (i.e. sleep 10
) and the hanging is still reproducible, so I don't think it's related to the build causing a 'timeout' or something similar.
2) there is something wrong in the actor system or how it is used.
We should probably try and put together a minimal reproduction, but TBH I am not super familiar with how this part of the codebase works, and it's going to require quite some effort.
The differences with the socket local and remote addresses might be a red herring. Since Rally uses thespian 3.10.1, I tried 3.10.3 just to see if it would have an impact and it did not. I suspect the issue has more to do with attempting to act on an invalid TCP socket. I can confirm the issue occurs somewhere in here, reliably, every time ES is built from source:
[2022-09-13T11:05:20,746][INFO ][o.e.x.s.c.f.PersistentCache] [rally-node-0] persistent cache index loaded
[2022-09-13T11:05:20,747][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [rally-node-0] deprecation component started
[2022-09-13T11:05:20,824][INFO ][o.e.t.TransportService ] [rally-node-0] publish_address {127.0.0.1:29300}, bound_addresses {127.0.0.1:29300}
[2022-09-13T11:05:21,053][INFO ][o.e.c.c.ClusterBootstrapService] [rally-node-0] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to [rally-node-0]
[2022-09-13T11:05:21,059][INFO ][o.e.c.c.Coordinator ] [rally-node-0] setting initial configuration to VotingConfiguration{_QoTerzwS5-mpL4yREcdVQ}
[2022-09-13T11:05:21,379][INFO ][o.e.c.s.MasterService ] [rally-node-0] elected-as-master ([1] nodes joined)[_FINISH_ELECTION_, {rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw} completing elect
ion], term: 1, version: 1, delta: master node changed {previous [], current [{rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw}]}
[2022-09-13T11:05:21,463][INFO ][o.e.c.c.CoordinationState] [rally-node-0] cluster UUID set to [u_62GO_3Q6qNgIN3pmNcyA]
[2022-09-13T11:05:21,589][INFO ][o.e.c.s.ClusterApplierService] [rally-node-0] master node changed {previous [], current [{rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw}]}, term: 1, version: 1, r
eason: Publication{term=1, version=1}
[2022-09-13T11:05:21,632][INFO ][o.e.r.s.FileSettingsService] [rally-node-0] starting file settings watcher ...
[2022-09-13T11:05:21,637][INFO ][o.e.r.s.FileSettingsService] [rally-node-0] file settings service up and running [tid=45]
[2022-09-13T11:05:21,639][INFO ][o.e.h.AbstractHttpServerTransport] [rally-node-0] publish_address {127.0.0.1:29200}, bound_addresses {127.0.0.1:29200}
[2022-09-13T11:05:21,639][INFO ][o.e.n.Node ] [rally-node-0] started {rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw}{xpack.installed=true}
[2022-09-13T11:05:21,778][INFO ][o.e.g.GatewayService ] [rally-node-0] recovered [0] indices into cluster_state
[2022-09-13T11:05:22,162][INFO ][o.e.c.m.MetadataIndexTemplateService] [rally-node-0] adding component template [.deprecation-indexing-mappings]
<continues>
If Rally does not have to initiate an ES build, the benchmark always succeeds.
2022-09-13 11:03:56,765 ActorAddr-(T|:59122)/PID:44858 esrally.utils.repo INFO Rebasing on [master] in [/Users/jbryan/.rally/benchmarks/teams/default] for distribution version [None].
2022-09-13 11:03:56,804 ActorAddr-(T|:59122)/PID:44858 esrally.utils.process INFO Already on 'master'
Your branch is up to date with 'origin/master'.
2022-09-13 11:03:56,835 ActorAddr-(T|:59122)/PID:44858 esrally.utils.process INFO Current branch master is up to date.
2022-09-13 11:03:56,855 ActorAddr-(T|:59122)/PID:44858 esrally.mechanic.team INFO Loading plugin [analysis-icu] with default configuration.
2022-09-13 11:03:56,856 ActorAddr-(T|:59122)/PID:44858 esrally.actor INFO Cluster consisting of [{'host': '127.0.0.1', 'port': 29200}] will be provisioned by Rally.
2022-09-13 11:03:56,858 ActorAddr-(T|:59122)/PID:44858 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespi
an Watch Supported': True, 'Python Version': (3, 10, 4, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1663081435739'}] match requirements [{}].
2022-09-13 11:03:56,898 ActorAddr-(T|:59138)/PID:44884 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespi
an Watch Supported': True, 'Python Version': (3, 10, 4, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1663081435739'}] match requirements [{'coordinator': True}].
2022-09-13 11:03:56,915 ActorAddr-(T|:59141)/PID:44885 esrally.actor INFO Starting node(s) [0] on [127.0.0.1].
(END)
2022-09-13 11:03:56.447174 p44836 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999909-<class 'logging.LogRecord'>-<LogRecord: esrally.metrics, 20, /Users/jbryan/dev/src/rally/esrally/metrics.py, 1239, "Creating file race store">-quit_0:04:59.999890)
2022-09-13 11:03:56.447769 p44836 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999940-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/racecontrol.py, 111, "Asking mechanic to start the engine.">-quit_0:04:59.999923)
2022-09-13 11:03:56.448454 p44836 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999931-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/actor.py, 124, "Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention ...-quit_0:04:59.99
9913)
2022-09-13 11:03:56.452935 p44836 dbg Attempting intent TransportIntent(ActorAddr-LocalAddr.0-pending-ExpiresIn_0:04:59.999416-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c4c0580>-quit_0:04:59.998350)
2022-09-13 11:03:56.453415 p44836 dbg CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0
2022-09-13 11:03:56.455489 p44858 I Starting Actor esrally.mechanic.mechanic.MechanicActor at ActorAddr-(T|:59122) (parent ActorAddr-(T|:59101), admin ActorAddr-(T|:1900), srcHash None)
2022-09-13 11:03:56.456596 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59101)-pending-ExpiresIn_0:04:59.999606-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x10c4c2380>-quit_0:04:59.999562)
2022-09-13 11:03:56.456894 p44836 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59101), <class 'thespian.system.messages.multiproc.ChildMayHaveDied'> msg: <thespian.system.messages.multiproc.ChildMayHaveDied object at 0x10c4c2290>)
2022-09-13 11:03:56.486113 p44836 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59122), <class 'thespian.system.messages.multiproc.EndpointConnected'> msg: <thespian.system.messages.multiproc.EndpointConnected object at 0x10c4c2230>)
2022-09-13 11:03:56.487008 p44836 dbg Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.964128-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c4c0580>-quit_0:04:59.964107)
2022-09-13 11:03:56.487854 p44836 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.963275-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c4c0580>-quit_0:04:59.963258)
2022-09-13 11:03:56.489045 p44858 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59101), <class 'esrally.mechanic.mechanic.StartEngine'> msg: <esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>)
2022-09-13 11:03:56.489814 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999909-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py, 366, "Received signal from race control to start engine.">-quit_0:04:59.999891)
2022-09-13 11:03:56.743671 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999772-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.repo, 20, /Users/jbryan/dev/src/rally/esrally/utils/repo.py, 68, "Checking out [master] in [/Users/jbryan/.rally/benchmarks/...-quit_0:04:59.999733)
2022-09-13 11:03:56.764912 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999740-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Switched to branch 'master'
Your branch is up to dat...-quit_0:04:59.999715)
2022-09-13 11:03:56.765567 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999937-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.repo, 20, /Users/jbryan/dev/src/rally/esrally/utils/repo.py, 72, "Rebasing on [master] in [/Users/jbryan/.rally/benchmarks/t...-quit_0:04:59.999921)
2022-09-13 11:03:56.804984 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999742-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Already on 'master'
Your branch is up to date with '...-quit_0:04:59.999715)
2022-09-13 11:03:56.836045 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999766-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Current branch master is up to date.
">-quit_0:04:59.999740)
2022-09-13 11:03:56.856111 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999794-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.team, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/team.py, 339, "Loading plugin [analysis-icu] with default configur...-quit_0:04:59.999771)
2022-09-13 11:03:56.857504 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999924-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py, 387, "Cluster consisting of [{'host': '127.0.0.1', 'port': 29...-quit_0:04:59.999906)
2022-09-13 11:03:56.858302 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999838-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/actor.py, 124, "Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention ...-quit_0:04:59.999821)
2022-09-13 11:03:56.861429 p44858 dbg Attempting intent TransportIntent(ActorAddr-LocalAddr.0-pending-ExpiresIn_0:04:59.999415-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>-quit_0:04:59.999357)
2022-09-13 11:03:56.861996 p44858 dbg CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0
2022-09-13 11:03:56.864175 p44884 I Starting Actor esrally.mechanic.mechanic.Dispatcher at ActorAddr-(T|:59138) (parent ActorAddr-(T|:59122), admin ActorAddr-(T|:1900), srcHash None)
2022-09-13 11:03:56.865495 p44884 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.999467-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x10c47fb20>-quit_0:04:59.999421)
2022-09-13 11:03:56.866421 p44858 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59122), <class 'thespian.system.messages.multiproc.ChildMayHaveDied'> msg: <thespian.system.messages.multiproc.ChildMayHaveDied object at 0x10c47fa30>)
2022-09-13 11:03:56.893966 p44858 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59138), <class 'thespian.system.messages.multiproc.EndpointConnected'> msg: <thespian.system.messages.multiproc.EndpointConnected object at 0x10c4c2320>)
2022-09-13 11:03:56.894771 p44858 dbg Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.965815-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>-quit_0:04:59.965797)
2022-09-13 11:03:56.895724 p44858 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.964839-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>-quit_0:04:59.964822)
2022-09-13 11:03:56.897395 p44884 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59122), <class 'esrally.mechanic.mechanic.StartEngine'> msg: <esrally.mechanic.mechanic.StartEngine object at 0x10c4e7a30>)
2022-09-13 11:03:56.898738 p44884 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999878-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/actor.py, 124, "Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention ...-quit_0:04:59.999859)
2022-09-13 11:03:56.902820 p44884 dbg Attempting intent TransportIntent(ActorAddr-LocalAddr.0-pending-ExpiresIn_0:04:59.999394-<class 'esrally.mechanic.mechanic.StartNodes'>-<esrally.mechanic.mechanic.StartNodes object at 0x10c4e79d0>-quit_0:04:59.999342)
2022-09-13 11:03:56.903376 p44884 dbg CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0
2022-09-13 11:03:56.905181 p44885 I Starting Actor esrally.mechanic.mechanic.NodeMechanicActor at ActorAddr-(T|:59141) (parent ActorAddr-(T|:59138), admin ActorAddr-(T|:1900), srcHash None)
2022-09-13 11:03:56.906326 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.999581-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x10c4e47f0>-quit_0:04:59.999537)
2022-09-13 11:03:56.911268 p44884 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59141), <class 'thespian.system.messages.multiproc.EndpointConnected'> msg: <thespian.system.messages.multiproc.EndpointConnected object at 0x10c4e7c40>)
2022-09-13 11:03:56.912409 p44884 dbg Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:59141)-pending-ExpiresIn_0:04:59.989520-<class 'esrally.mechanic.mechanic.StartNodes'>-<esrally.mechanic.mechanic.StartNodes object at 0x10c4e79d0>-quit_0:04:59.989499)
2022-09-13 11:03:56.913529 p44884 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59141)-pending-ExpiresIn_0:04:59.988366-<class 'esrally.mechanic.mechanic.StartNodes'>-<esrally.mechanic.mechanic.StartNodes object at 0x10c4e79d0>-quit_0:04:59.988347)
2022-09-13 11:03:56.914883 p44885 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59138), <class 'esrally.mechanic.mechanic.StartNodes'> msg: <esrally.mechanic.mechanic.StartNodes object at 0x10c4e57e0>)
2022-09-13 11:03:56.915862 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999846-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py, 552, "Starting node(s) [0] on [127.0.0.1].">-quit_0:04:59.999826)
2022-09-13 11:03:56.920998 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999909-<class 'logging.LogRecord'>-<LogRecord: esrally.metrics, 20, /Users/jbryan/dev/src/rally/esrally/metrics.py, 408, "Opening metrics store for race timestamp=[20220913T150348Z], tr...-quit_0:04:59.99
9889)
2022-09-13 11:03:57.165369 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999780-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.repo, 20, /Users/jbryan/dev/src/rally/esrally/utils/repo.py, 127, "Checking out revision [c18ab4d] in [/Users/jbryan/.rally/...-quit_0:04:59.99
9754)
2022-09-13 11:03:57.199535 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999768-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Note: switching to 'c18ab4d'.
You are in 'detached ...-quit_0:04:59.999741)
2022-09-13 11:03:57.202417 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999928-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.team, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/team.py, 339, "Loading plugin [analysis-icu] with default configur...-quit_0:04:59.99
9909)
2022-09-13 11:03:57.204056 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999933-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 74, "Enabling source artifact caching.">-quit_0:04:59.999915)
2022-09-13 11:03:57.206042 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999941-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 115, "Adding core plugin source supplier for [ana...-quit_0:04:59.99
9924)
2022-09-13 11:03:57.206654 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999943-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.java_resolver, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/java_resolver.py, 44, "Using JDK bundled with Elasticsear...-quit_0:04:59.99
9926)
Noting a break here since it takes a moment to download ES source.
2022-09-13 11:05:00.271099 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999732-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Cloning into '/Users/jbryan/.rally/benchmarks/src/el...-quit_0:04:59.99
9702)
2022-09-13 11:05:00.271966 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999918-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 630, "Fetching from remote and checking out revis...-quit_0:04:59.99
9898)
2022-09-13 11:05:01.300922 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999746-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Note: switching to '2c4731e900b4da94503fa570a27547c2...-quit_0:04:59.99
9719)
2022-09-13 11:05:01.319588 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999767-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 640, "User-specified revision [@2022-09-11] for [...-quit_0:04:59.99
9741)
2022-09-13 11:05:01.320486 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999902-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 626, "Skip fetching sources for Elasticsearch.">-quit_0:04:59.999883
)
2022-09-13 11:05:01.347340 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999706-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 640, "User-specified revision [current] for [Elas...-quit_0:04:59.99
9677)
2022-09-13 11:05:01.348167 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999928-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 342, "Using cached artifact in [/Users/jbryan/.ra...-quit_0:04:59.99
9908)
2022-09-13 11:05:01.348863 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999887-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 342, "Using cached artifact in [/Users/jbryan/.ra...-quit_0:04:59.99
9866)
2022-09-13 11:05:01.349631 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999925-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.provisioner, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/provisioner.py, 275, "Preparing candidate locally in [/User...-quit_0:04:59.99
9905)
2022-09-13 11:05:01.353926 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999668-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.provisioner, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/provisioner.py, 280, "Unzipping /Users/jbryan/.rally/benchm...-quit_0:04:59.999645)
2022-09-13 11:05:05.310665 p44885 dbg actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999753-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.provisioner, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/provisioner.py, 287, "Deleting pre-bundled Elasticsearch co...-quit_0:04:59.999692)
2022-09-13 11:05:21.996964 p44885 dbg Attempting intent TransportIntent(ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.999525-<class 'esrally.mechanic.mechanic.NodesStarted'>-<esrally.mechanic.mechanic.NodesStarted object at 0x10c4f2fe0>-quit_0:04:59.999501)
2022-09-13 11:05:22.002274 p44885 ERR Actor esrally.mechanic.mechanic.NodeMechanicActor @ ActorAddr-(T|:59141) transport run exception: Traceback (most recent call last):
File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/actorManager.py", line 87, in run
r = self.transport.run(self.handleMessages)
File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run
rval = self._run_subtransport(incomingHandler, max_runtime)
File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
rval = self._runWithExpiry(incomingHandler)
File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/TCPTransport.py", line 1219, in _runWithExpiry
self._acceptNewIncoming()
File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/TCPTransport.py", line 1346, in _acceptNewIncoming
lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
OSError: [Errno 22] Invalid argument
2022-09-13 11:05:22.003103 p44885 dbg Attempting intent TransportIntent(ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.999732-<class 'thespian.actors.ChildActorExited'>-ChildActorExited:ActorAddr-(T|:59141)-quit_0:04:59.999701)
The Actor system error occurring right at esrally.mechanic.mechanic.NodesStarted
is suspicious. Like @b-deam says, I think it is going to require some minimal reproduction to track it down.
I reported the issue upstream: https://github.com/thespianpy/Thespian/issues/70
Right now if anyone faces the issue I'd recommend to apply the local workaround until we hear more from the Thespian side.
I encountered another way to reproduce the issue after forgetting to use the workaround. The stall can happen directly after downloading elastic/logs track corpora and building file offsets. It reproduces pretty reliably on macOS.
https://gist.github.com/inqueue/ddb5c28bb0512ebdc26391bd32e10d91
Rally version (get with
esrally --version
): Only attempted reproduction with main/master branch:esrally 2.6.1.dev0
, but suspect more branches are affected.Invoked command:
Configuration file (located in
~/.rally/rally.ini
)):JVM version: N/A (reproducible without JVM installed)
OS version:
Description of the problem including expected versus actual behavior:
Expected: Rally installs and starts specified revision of Elasticsearch and plugin(s) and then runs the specified benchmark experiment against the cluster Actual: Rally hangs indefinitely, after having successfully built Elastiscsearch and any plugin(s) from source
Note that
esrally install ...
works as expected using the same code paths asesrally race --revision
initially does. The only difference between the two is thatesrally race --revision
starts the Actor Sub system.Steps to reproduce:
python -m venv .venv && source .venv/bin/activate && make install
esrally race --revision="@2022-09-11" --track=geonames --test-mode --kill-running-processes --target-hosts=127.0.0.1:29200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --runtime-jdk=bundled
orpytest -s it -k test_sources
[INFO] Race id is [2b313325-31b2-4f45-884d-c9ca86495c73] [INFO] Preparing for race ...
$ pstree -s esrally -w -+= 00001 root /sbin/launchd |-+= 00766 bradleydeam /Library/Application Support/iTerm2/iTermServer-3.4.3 /Library/Application Support/iTerm2/iterm2-daemon-1.socket | -+= 00767 root /usr/bin/login -fpl bradleydeam /Applications/iTerm.app/Contents/MacOS/iTerm2 --launch_shell | -+= 00777 bradleydeam -zsh | -+= 92100 bradleydeam /Applications/Xcode.app/Contents/Developer/usr/bin/make it | -+- 92624 bradleydeam /bin/bash -c . /elastic/rally/.venv/bin/activate; tox -e py38-it | -+- 92625 bradleydeam /elastic/rally/.venv/bin/python3 /elastic/rally/.venv/bin/tox -e py38-it | -+- 92757 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/pytest -s it --junitxml=junit-py38-it.xml | -+- 98774 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it | --= 98777 bradleydeam (python3.8) -+- 98778 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it |--- 98779 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it -+- 98780 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it -+- 98803 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it -+- 98822 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it --- 98823 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it
$ sudo py-spy dump --pid 98823 Process 98823: /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it Python v3.8.10 (/.pyenv/versions/3.8.10/bin/python3.8)
Thread 0x100710580 (active): "MainThread" exclusive_processing (thespian/system/transport/asyncTransportBase.py:46) enter (contextlib.py:113) _runWithExpiry (thespian/system/transport/TCPTransport.py:1094) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) drainTransmits (thespian/system/systemCommon.py:202) run (thespian/system/actorManager.py:112) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) createActor (thespian/system/actorManager.py:316) createActor (thespian/actors.py:189) receiveMsg_StartEngine (esrally/mechanic/mechanic.py:486) guard (esrally/actor.py:92) receiveMessage (thespian/actors.py:838) _handleOneMessage (thespian/system/actorManager.py:163) handleMessages (thespian/system/actorManager.py:121) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/actorManager.py:87) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) createActor (thespian/system/actorManager.py:316) createActor (thespian/actors.py:189) receiveMsg_StartEngine (esrally/mechanic/mechanic.py:392) guard (esrally/actor.py:92) receiveMessage (thespian/actors.py:838) _handleOneMessage (thespian/system/actorManager.py:163) handleMessages (thespian/system/actorManager.py:121) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/actorManager.py:87) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) createActor (thespian/system/actorManager.py:316) createActor (thespian/actors.py:189) receiveMsg_Setup (esrally/racecontrol.py:112) guard (esrally/actor.py:92) receiveMessage (thespian/actors.py:838) _handleOneMessage (thespian/system/actorManager.py:163) handleMessages (thespian/system/actorManager.py:121) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/actorManager.py:87) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) h_PendingActor (thespian/system/admin/adminCore.py:318) h_PendingActor (thespian/system/admin/globalNames.py:19) handleIncoming (thespian/system/admin/adminCore.py:114) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/admin/convention.py:643) startAdmin (thespian/system/multiprocCommon.py:207) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startAdmin (thespian/system/multiprocCommon.py:104) init (thespian/system/systemBase.py:326) init (thespian/system/multiprocCommon.py:86) init (thespian/system/multiprocTCPBase.py:28) _startupActorSys (thespian/actors.py:676) init (thespian/actors.py:635) bootstrap_actor_system (esrally/actor.py:263) with_actor_system (esrally/rally.py:775) race (esrally/rally.py:767) dispatch_sub_command (esrally/rally.py:991) main (esrally/rally.py:1082)
Full actor debug logs
``` 2022-09-12 13:22:53.875894 p16688 dbg ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py 2022-09-12 13:22:53.876248 p16688 I ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 10) 2022-09-12 13:22:53.884362 p16688 dbg Admin of ReceiveEnvelope(from: ActorAddr-(T|:61007),