elastic / rally

Macrobenchmarking framework for Elasticsearch
Apache License 2.0
1.95k stars 313 forks source link

Rally hangs indefinitely on OSX when running 'esrally race --revision' #1575

Open b-deam opened 2 years ago

b-deam commented 2 years ago

Rally version (get with esrally --version): Only attempted reproduction with main/master branch: esrally 2.6.1.dev0, but suspect more branches are affected.

Invoked command:

esrally race --revision="@2022-09-11" --track=geonames --test-mode --kill-running-processes --target-hosts=127.0.0.1:29200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --runtime-jdk=bundled

Configuration file (located in ~/.rally/rally.ini)):

[meta]
config.version = 17

[system]
env.name = local

[node]
root.dir = /Users/bradleydeam/.rally/benchmarks
src.root.dir = /Users/bradleydeam/.rally/benchmarks/src

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch
cache = false

[benchmarks]
local.dataset.cache = /Users/bradleydeam/.rally/benchmarks/data

[reporting]
datastore.type = in-memory

[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = false

[distributions]
release.cache = true

JVM version: N/A (reproducible without JVM installed)

OS version:

Darwin MacBook-Pro-2.local 21.4.0 Darwin Kernel Version 21.4.0: Fri Mar 18 00:46:32 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T6000 x86_64

Description of the problem including expected versus actual behavior:

Expected: Rally installs and starts specified revision of Elasticsearch and plugin(s) and then runs the specified benchmark experiment against the cluster Actual: Rally hangs indefinitely, after having successfully built Elastiscsearch and any plugin(s) from source

Note that esrally install ... works as expected using the same code paths as esrally race --revision initially does. The only difference between the two is that esrally race --revision starts the Actor Sub system.

Steps to reproduce:

  1. Install Rally from master, python -m venv .venv && source .venv/bin/activate && make install
  2. Execute esrally race --revision="@2022-09-11" --track=geonames --test-mode --kill-running-processes --target-hosts=127.0.0.1:29200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --runtime-jdk=bundled or pytest -s it -k test_sources
  3. Watch as Rally hangs indefinitely:
    
    $ pytest -s -it -k test_sources
    [...]
    it/sources_test.py::test_sources[es-it]
    ____        ____
    / __ \____ _/ / /_  __
    / /_/ / __ `/ / / / / /
    / _, _/ /_/ / / / /_/ /
    /_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [2b313325-31b2-4f45-884d-c9ca86495c73] [INFO] Preparing for race ...


**Provide logs (if relevant)**:

`py-spy` dump:

$ pstree -s esrally -w -+= 00001 root /sbin/launchd |-+= 00766 bradleydeam /Library/Application Support/iTerm2/iTermServer-3.4.3 /Library/Application Support/iTerm2/iterm2-daemon-1.socket | -+= 00767 root /usr/bin/login -fpl bradleydeam /Applications/iTerm.app/Contents/MacOS/iTerm2 --launch_shell | -+= 00777 bradleydeam -zsh | -+= 92100 bradleydeam /Applications/Xcode.app/Contents/Developer/usr/bin/make it | -+- 92624 bradleydeam /bin/bash -c . /elastic/rally/.venv/bin/activate; tox -e py38-it | -+- 92625 bradleydeam /elastic/rally/.venv/bin/python3 /elastic/rally/.venv/bin/tox -e py38-it | -+- 92757 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/pytest -s it --junitxml=junit-py38-it.xml | -+- 98774 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it | --= 98777 bradleydeam (python3.8) -+- 98778 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it |--- 98779 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it -+- 98780 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it -+- 98803 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it -+- 98822 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it --- 98823 bradleydeam /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it

$ sudo py-spy dump --pid 98823 Process 98823: /elastic/rally/.tox/py38-it/bin/python /elastic/rally/.tox/py38-it/bin/esrally race --revision=latest --track=geonames --test-mode --target-hosts=127.0.0.1:19200 --challenge=append-no-conflicts --car=4gheap,basic-license --elasticsearch-plugins=analysis-icu --kill-running-processes --on-error=abort --enable-assertions --configuration-name=es-it Python v3.8.10 (/.pyenv/versions/3.8.10/bin/python3.8)

Thread 0x100710580 (active): "MainThread" exclusive_processing (thespian/system/transport/asyncTransportBase.py:46) enter (contextlib.py:113) _runWithExpiry (thespian/system/transport/TCPTransport.py:1094) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) drainTransmits (thespian/system/systemCommon.py:202) run (thespian/system/actorManager.py:112) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) createActor (thespian/system/actorManager.py:316) createActor (thespian/actors.py:189) receiveMsg_StartEngine (esrally/mechanic/mechanic.py:486) guard (esrally/actor.py:92) receiveMessage (thespian/actors.py:838) _handleOneMessage (thespian/system/actorManager.py:163) handleMessages (thespian/system/actorManager.py:121) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/actorManager.py:87) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) createActor (thespian/system/actorManager.py:316) createActor (thespian/actors.py:189) receiveMsg_StartEngine (esrally/mechanic/mechanic.py:392) guard (esrally/actor.py:92) receiveMessage (thespian/actors.py:838) _handleOneMessage (thespian/system/actorManager.py:163) handleMessages (thespian/system/actorManager.py:121) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/actorManager.py:87) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) createActor (thespian/system/actorManager.py:316) createActor (thespian/actors.py:189) receiveMsg_Setup (esrally/racecontrol.py:112) guard (esrally/actor.py:92) receiveMessage (thespian/actors.py:838) _handleOneMessage (thespian/system/actorManager.py:163) handleMessages (thespian/system/actorManager.py:121) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/actorManager.py:87) startChild (thespian/system/multiprocCommon.py:591) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startChildActor (thespian/system/multiprocCommon.py:346) h_PendingActor (thespian/system/admin/adminCore.py:318) h_PendingActor (thespian/system/admin/globalNames.py:19) handleIncoming (thespian/system/admin/adminCore.py:114) _runWithExpiry (thespian/system/transport/TCPTransport.py:1310) _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80) run (thespian/system/transport/wakeupTransportBase.py:71) run (thespian/system/admin/convention.py:643) startAdmin (thespian/system/multiprocCommon.py:207) run (multiprocessing/process.py:108) _bootstrap (multiprocessing/process.py:315) _launch (multiprocessing/popen_fork.py:75) init (multiprocessing/popen_fork.py:19) _Popen (multiprocessing/context.py:277) start (multiprocessing/process.py:121) _startAdmin (thespian/system/multiprocCommon.py:104) init (thespian/system/systemBase.py:326) init (thespian/system/multiprocCommon.py:86) init (thespian/system/multiprocTCPBase.py:28) _startupActorSys (thespian/actors.py:676) init (thespian/actors.py:635) bootstrap_actor_system (esrally/actor.py:263) with_actor_system (esrally/rally.py:775) race (esrally/rally.py:767) dispatch_sub_command (esrally/rally.py:991) main (esrally/rally.py:1082)

(esrally:8) ``` Setting Thespian debug logs with: ``` export THESPLOG_FILE="${THESPLOG_FILE:-${HOME}/.rally/logs/actor-system-internal.log}" export THESPLOG_FILE_MAXSIZE=${THESPLOG_FILE_MAXSIZE:-204800} export THESPLOG_THRESHOLD="DEBUG" ``` Allowed me to capture this as Rally 'hung': ``` 2022-09-12 13:22:58.969211 p16740 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999865----------quit_0:04:59.999734) 2022-09-12 13:24:35.664993 p16740 ERR Actor esrally.mechanic.mechanic.NodeMechanicActor @ ActorAddr-(T|:61053) transport run exception: Traceback (most recent call last): File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/actorManager.py", line 88, in run r = self.transport.run(self.handleMessages) File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run rval = self._run_subtransport(incomingHandler, max_runtime) File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport rval = self._runWithExpiry(incomingHandler) File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1219, in _runWithExpiry self._acceptNewIncoming() File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1342, in _acceptNewIncoming lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) OSError: [Errno 22] Invalid argument 2022-09-12 13:24:35.665390 p16740 dbg Attempting intent TransportIntent(ActorAddr-(T|:61050)-pending-ExpiresIn_0:04:59.999884--ChildActorExited:ActorAddr-(T|:61053)-quit_0:04:59.999871) ```
Full actor debug logs ``` 2022-09-12 13:22:53.875894 p16688 dbg ++++ Starting Admin from /Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/__init__.py 2022-09-12 13:22:53.876248 p16688 I ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 10) 2022-09-12 13:22:53.884362 p16688 dbg Admin of ReceiveEnvelope(from: ActorAddr-(T|:61007), msg: ) 2022-09-12 13:22:53.884931 p16688 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61006)-pending-ExpiresIn_0:04:59.999856---quit_0:04:59.999847) 2022-09-12 13:22:53.887899 p16669 dbg actualTransmit of TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.999802---quit_0:04:59.999787) 2022-09-12 13:22:53.888433 p16688 dbg Admin of ReceiveEnvelope(from: ActorAddr-(T|:61006), msg: ) 2022-09-12 13:22:53.888944 p16688 dbg Attempting intent TransportIntent(ActorAddr-(T|:61006)-pending-ExpiresIn_0:04:59.999784---quit_0:04:59.999773) 2022-09-12 13:22:53.889216 p16688 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61006)-pending-ExpiresIn_0:04:59.999473---quit_0:04:59.999466) 2022-09-12 13:22:53.891642 p16669 dbg actualTransmit of TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.999928--PendingActor#1_of_None-quit_0:04:59.999919) 2022-09-12 13:22:53.894153 p16688 dbg Admin of ReceiveEnvelope(from: ActorAddr-(T|:61011), msg: PendingActor#1_of_None) 2022-09-12 13:22:53.894394 p16688 I Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-(T|:61011) 2022-09-12 13:22:53.894965 p16688 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999917----quit_0:04:59.999878) 2022-09-12 13:22:53.900451 p16688 dbg Attempting intent TransportIntent(ActorAddr-(T|:61011)-pending-ExpiresIn_0:04:59.999824--PendingActorResponse(for None inst# 1) errCode None actual ActorAddr-(T|:61013)-quit_0:04:59.999817) 2022-09-12 13:22:53.900716 p16688 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61011)-pending-ExpiresIn_0:04:59.999505--PendingActorResponse(for None inst# 1) errCode None actual ActorAddr-(T|:61013)-quit_0:04:59.999499) 2022-09-12 13:22:53.901609 p16669 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61013)-pending-ExpiresIn_0:04:59.999833---quit_0:04:59.999825) 2022-09-12 13:22:53.906689 p16690 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61006), msg: ) 2022-09-12 13:22:54.728708 p16690 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999890-------------quit_0:04:59.999882) 2022-09-12 13:22:54.986695 p16690 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999969----quit_0:04:59.999898) 2022-09-12 13:22:54.987705 p16690 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999967---quit_0:04:59.999959) 2022-09-12 13:22:54.988052 p16690 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999970----quit_0:04:59.999699) 2022-09-12 13:22:54.990282 p16690 dbg CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0 2022-09-12 13:22:54.991046 p16690 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61013), msg: ) 2022-09-12 13:22:54.991759 p16718 I Starting Actor esrally.mechanic.mechanic.MechanicActor at ActorAddr-(T|:61034) (parent ActorAddr-(T|:61013), admin ActorAddr-(T|:1900), srcHash None) 2022-09-12 13:22:54.992870 p16718 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61013)-pending-ExpiresIn_0:04:59.999306---quit_0:04:59.999296) 2022-09-12 13:22:55.000417 p16690 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61034), msg: ) 2022-09-12 13:22:55.000901 p16690 dbg Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:61034)-pending-ExpiresIn_0:04:59.988730---quit_0:04:59.988722) 2022-09-12 13:22:55.001487 p16690 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61034)-pending-ExpiresIn_0:04:59.988184---quit_0:04:59.988175) 2022-09-12 13:22:55.002233 p16718 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61013), msg: ) 2022-09-12 13:22:55.002662 p16718 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999949------------quit_0:04:59.999711) 2022-09-12 13:22:55.928101 p16718 dbg CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0 2022-09-12 13:22:55.928848 p16718 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61034), msg: ) 2022-09-12 13:22:55.929355 p16739 I Starting Actor esrally.mechanic.mechanic.Dispatcher at ActorAddr-(T|:61050) (parent ActorAddr-(T|:61034), admin ActorAddr-(T|:1900), srcHash None) 2022-09-12 13:22:55.929827 p16739 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61034)-pending-ExpiresIn_0:04:59.999873---quit_0:04:59.999863) 2022-09-12 13:22:55.935353 p16718 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61050), msg: ) 2022-09-12 13:22:55.935740 p16718 dbg Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:61050)-pending-ExpiresIn_0:04:59.991700---quit_0:04:59.991692) 2022-09-12 13:22:55.936455 p16718 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61050)-pending-ExpiresIn_0:04:59.991061---quit_0:04:59.991052) 2022-09-12 13:22:55.937301 p16739 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61034), msg: ) 2022-09-12 13:22:55.937950 p16739 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999932----quit_0:04:59.999666) 2022-09-12 13:22:55.940877 p16739 dbg CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0 2022-09-12 13:22:55.942104 p16740 I Starting Actor esrally.mechanic.mechanic.NodeMechanicActor at ActorAddr-(T|:61053) (parent ActorAddr-(T|:61050), admin ActorAddr-(T|:1900), srcHash None) 2022-09-12 13:22:55.942537 p16740 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61050)-pending-ExpiresIn_0:04:59.999866---quit_0:04:59.999856) 2022-09-12 13:22:55.944364 p16739 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61053), msg: ) 2022-09-12 13:22:55.944744 p16739 dbg Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:61053)-pending-ExpiresIn_0:04:59.995443---quit_0:04:59.995435) 2022-09-12 13:22:55.945263 p16739 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61053)-pending-ExpiresIn_0:04:59.994926---quit_0:04:59.994919) 2022-09-12 13:22:55.946014 p16740 dbg ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:61050), msg: ) 2022-09-12 13:22:55.946453 p16740 dbg actualTransmit of TransportIntent(ActorAddr-(T|:61007)-pending-ExpiresIn_0:04:59.999947-----------------------quit_0:04:59.999734) 2022-09-12 13:24:35.664993 p16740 ERR Actor esrally.mechanic.mechanic.NodeMechanicActor @ ActorAddr-(T|:61053) transport run exception: Traceback (most recent call last): File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/actorManager.py", line 88, in run r = self.transport.run(self.handleMessages) File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run rval = self._run_subtransport(incomingHandler, max_runtime) File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport rval = self._runWithExpiry(incomingHandler) File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1219, in _runWithExpiry self._acceptNewIncoming() File "/Users/bradleydeam/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py", line 1342, in _acceptNewIncoming lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) OSError: [Errno 22] Invalid argument 2022-09-12 13:24:35.665390 p16740 dbg Attempting intent TransportIntent(ActorAddr-(T|:61050)-pending-ExpiresIn_0:04:59.999884--ChildActorExited:ActorAddr-(T|:61053)-quit_0:04:59.999871) ```
Thanks to @pquentin, we found a workaround for this by commenting out line 1342 in [`.venv/lib/python3.8/site-packages/thespian/system/transport/TCPTransport.py`](https://github.com/thespianpy/Thespian/blob/581cc36e120493d9680bfead992d5e603e9d6068/thespian/system/transport/TCPTransport.py#L1342) ```diff - lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) + # lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) ``` We're not exactly sure what is happening (https://github.com/envoyproxy/envoy/issues/1446 suggests that perhaps we're trying to set an option on a socket that is shut down), but this appears to only affect OSX.
inqueue commented 2 years ago

On macOS, it only reproduces when Rally must invoke a build, otherwise, the race succeeds:

uname -a
Darwin grape.lan 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64 x86_64

The behavior is consistent with the rally.ini configuration for the local repository cache; i.e., do not cache artifacts:

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch
cache = false

During a successful benchmark, an stream of lsock objects are created at the start and they look like this:

_acceptNewIncoming:
      lsock=<socket.socket fd=29, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 52934), raddr=('192.168.5.116', 52967)>

Starting a run we know will fail (one without a cached ES artifact), lsock starts like this:

_acceptNewIncoming:
      lsock=<socket.socket fd=19, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 53029), raddr=('192.168.5.116', 53077)>

then ~85s later right before the failure point:

83.5s _acceptNewIncoming:
      lsock=<socket.socket fd=27, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 53523)>

The repro and error are the same every time. I don't know yet what any of this means, but it has me thinking that either: 1) the ES build is getting in the way, or 2) there is something wrong in the actor system or how it is used.

b-deam commented 2 years ago

1) the ES build is getting in the way,

You can stub out the build command with something simple (i.e. sleep 10) and the hanging is still reproducible, so I don't think it's related to the build causing a 'timeout' or something similar.

2) there is something wrong in the actor system or how it is used.

We should probably try and put together a minimal reproduction, but TBH I am not super familiar with how this part of the codebase works, and it's going to require quite some effort.

inqueue commented 2 years ago

The differences with the socket local and remote addresses might be a red herring. Since Rally uses thespian 3.10.1, I tried 3.10.3 just to see if it would have an impact and it did not. I suspect the issue has more to do with attempting to act on an invalid TCP socket. I can confirm the issue occurs somewhere in here, reliably, every time ES is built from source:

[2022-09-13T11:05:20,746][INFO ][o.e.x.s.c.f.PersistentCache] [rally-node-0] persistent cache index loaded
[2022-09-13T11:05:20,747][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [rally-node-0] deprecation component started
[2022-09-13T11:05:20,824][INFO ][o.e.t.TransportService   ] [rally-node-0] publish_address {127.0.0.1:29300}, bound_addresses {127.0.0.1:29300}
[2022-09-13T11:05:21,053][INFO ][o.e.c.c.ClusterBootstrapService] [rally-node-0] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to [rally-node-0]
[2022-09-13T11:05:21,059][INFO ][o.e.c.c.Coordinator      ] [rally-node-0] setting initial configuration to VotingConfiguration{_QoTerzwS5-mpL4yREcdVQ}
[2022-09-13T11:05:21,379][INFO ][o.e.c.s.MasterService    ] [rally-node-0] elected-as-master ([1] nodes joined)[_FINISH_ELECTION_, {rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw} completing elect
ion], term: 1, version: 1, delta: master node changed {previous [], current [{rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw}]}
[2022-09-13T11:05:21,463][INFO ][o.e.c.c.CoordinationState] [rally-node-0] cluster UUID set to [u_62GO_3Q6qNgIN3pmNcyA]
[2022-09-13T11:05:21,589][INFO ][o.e.c.s.ClusterApplierService] [rally-node-0] master node changed {previous [], current [{rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw}]}, term: 1, version: 1, r
eason: Publication{term=1, version=1}
[2022-09-13T11:05:21,632][INFO ][o.e.r.s.FileSettingsService] [rally-node-0] starting file settings watcher ...
[2022-09-13T11:05:21,637][INFO ][o.e.r.s.FileSettingsService] [rally-node-0] file settings service up and running [tid=45]
[2022-09-13T11:05:21,639][INFO ][o.e.h.AbstractHttpServerTransport] [rally-node-0] publish_address {127.0.0.1:29200}, bound_addresses {127.0.0.1:29200}
[2022-09-13T11:05:21,639][INFO ][o.e.n.Node               ] [rally-node-0] started {rally-node-0}{_QoTerzwS5-mpL4yREcdVQ}{PC4Ey4lhRc6PLajebkJ6Sg}{rally-node-0}{127.0.0.1}{127.0.0.1:29300}{cdfhilmrstw}{xpack.installed=true}
[2022-09-13T11:05:21,778][INFO ][o.e.g.GatewayService     ] [rally-node-0] recovered [0] indices into cluster_state
[2022-09-13T11:05:22,162][INFO ][o.e.c.m.MetadataIndexTemplateService] [rally-node-0] adding component template [.deprecation-indexing-mappings]
<continues>

If Rally does not have to initiate an ES build, the benchmark always succeeds.

rally.log

2022-09-13 11:03:56,765 ActorAddr-(T|:59122)/PID:44858 esrally.utils.repo INFO Rebasing on [master] in [/Users/jbryan/.rally/benchmarks/teams/default] for distribution version [None].
2022-09-13 11:03:56,804 ActorAddr-(T|:59122)/PID:44858 esrally.utils.process INFO Already on 'master'
Your branch is up to date with 'origin/master'.

2022-09-13 11:03:56,835 ActorAddr-(T|:59122)/PID:44858 esrally.utils.process INFO Current branch master is up to date.

2022-09-13 11:03:56,855 ActorAddr-(T|:59122)/PID:44858 esrally.mechanic.team INFO Loading plugin [analysis-icu] with default configuration.
2022-09-13 11:03:56,856 ActorAddr-(T|:59122)/PID:44858 esrally.actor INFO Cluster consisting of [{'host': '127.0.0.1', 'port': 29200}] will be provisioned by Rally.
2022-09-13 11:03:56,858 ActorAddr-(T|:59122)/PID:44858 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespi
an Watch Supported': True, 'Python Version': (3, 10, 4, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1663081435739'}] match requirements [{}].
2022-09-13 11:03:56,898 ActorAddr-(T|:59138)/PID:44884 esrally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespi
an Watch Supported': True, 'Python Version': (3, 10, 4, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1663081435739'}] match requirements [{'coordinator': True}].
2022-09-13 11:03:56,915 ActorAddr-(T|:59141)/PID:44885 esrally.actor INFO Starting node(s) [0] on [127.0.0.1].
(END)

Actor system log

2022-09-13 11:03:56.447174 p44836 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999909-<class 'logging.LogRecord'>-<LogRecord: esrally.metrics, 20, /Users/jbryan/dev/src/rally/esrally/metrics.py, 1239, "Creating file race store">-quit_0:04:59.999890)
2022-09-13 11:03:56.447769 p44836 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999940-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/racecontrol.py, 111, "Asking mechanic to start the engine.">-quit_0:04:59.999923)
2022-09-13 11:03:56.448454 p44836 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999931-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/actor.py, 124, "Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention ...-quit_0:04:59.99
9913)
2022-09-13 11:03:56.452935 p44836 dbg  Attempting intent TransportIntent(ActorAddr-LocalAddr.0-pending-ExpiresIn_0:04:59.999416-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c4c0580>-quit_0:04:59.998350)
2022-09-13 11:03:56.453415 p44836 dbg  CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0
2022-09-13 11:03:56.455489 p44858 I    Starting Actor esrally.mechanic.mechanic.MechanicActor at ActorAddr-(T|:59122) (parent ActorAddr-(T|:59101), admin ActorAddr-(T|:1900), srcHash None)
2022-09-13 11:03:56.456596 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59101)-pending-ExpiresIn_0:04:59.999606-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x10c4c2380>-quit_0:04:59.999562)
2022-09-13 11:03:56.456894 p44836 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59101), <class 'thespian.system.messages.multiproc.ChildMayHaveDied'> msg: <thespian.system.messages.multiproc.ChildMayHaveDied object at 0x10c4c2290>)
2022-09-13 11:03:56.486113 p44836 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59122), <class 'thespian.system.messages.multiproc.EndpointConnected'> msg: <thespian.system.messages.multiproc.EndpointConnected object at 0x10c4c2230>)
2022-09-13 11:03:56.487008 p44836 dbg  Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.964128-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c4c0580>-quit_0:04:59.964107)
2022-09-13 11:03:56.487854 p44836 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.963275-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c4c0580>-quit_0:04:59.963258)
2022-09-13 11:03:56.489045 p44858 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59101), <class 'esrally.mechanic.mechanic.StartEngine'> msg: <esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>)
2022-09-13 11:03:56.489814 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999909-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py, 366, "Received signal from race control to start engine.">-quit_0:04:59.999891)
2022-09-13 11:03:56.743671 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999772-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.repo, 20, /Users/jbryan/dev/src/rally/esrally/utils/repo.py, 68, "Checking out [master] in [/Users/jbryan/.rally/benchmarks/...-quit_0:04:59.999733)
2022-09-13 11:03:56.764912 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999740-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Switched to branch 'master'
Your branch is up to dat...-quit_0:04:59.999715)
2022-09-13 11:03:56.765567 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999937-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.repo, 20, /Users/jbryan/dev/src/rally/esrally/utils/repo.py, 72, "Rebasing on [master] in [/Users/jbryan/.rally/benchmarks/t...-quit_0:04:59.999921)
2022-09-13 11:03:56.804984 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999742-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Already on 'master'
Your branch is up to date with '...-quit_0:04:59.999715)
2022-09-13 11:03:56.836045 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999766-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Current branch master is up to date.
">-quit_0:04:59.999740)
2022-09-13 11:03:56.856111 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999794-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.team, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/team.py, 339, "Loading plugin [analysis-icu] with default configur...-quit_0:04:59.999771)
2022-09-13 11:03:56.857504 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999924-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py, 387, "Cluster consisting of [{'host': '127.0.0.1', 'port': 29...-quit_0:04:59.999906)
2022-09-13 11:03:56.858302 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999838-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/actor.py, 124, "Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention ...-quit_0:04:59.999821)
2022-09-13 11:03:56.861429 p44858 dbg  Attempting intent TransportIntent(ActorAddr-LocalAddr.0-pending-ExpiresIn_0:04:59.999415-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>-quit_0:04:59.999357)
2022-09-13 11:03:56.861996 p44858 dbg  CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0
2022-09-13 11:03:56.864175 p44884 I    Starting Actor esrally.mechanic.mechanic.Dispatcher at ActorAddr-(T|:59138) (parent ActorAddr-(T|:59122), admin ActorAddr-(T|:1900), srcHash None)
2022-09-13 11:03:56.865495 p44884 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.999467-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x10c47fb20>-quit_0:04:59.999421)
2022-09-13 11:03:56.866421 p44858 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59122), <class 'thespian.system.messages.multiproc.ChildMayHaveDied'> msg: <thespian.system.messages.multiproc.ChildMayHaveDied object at 0x10c47fa30>)
2022-09-13 11:03:56.893966 p44858 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59138), <class 'thespian.system.messages.multiproc.EndpointConnected'> msg: <thespian.system.messages.multiproc.EndpointConnected object at 0x10c4c2320>)
2022-09-13 11:03:56.894771 p44858 dbg  Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.965815-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>-quit_0:04:59.965797)
2022-09-13 11:03:56.895724 p44858 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.964839-<class 'esrally.mechanic.mechanic.StartEngine'>-<esrally.mechanic.mechanic.StartEngine object at 0x10c47f130>-quit_0:04:59.964822)
2022-09-13 11:03:56.897395 p44884 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59122), <class 'esrally.mechanic.mechanic.StartEngine'> msg: <esrally.mechanic.mechanic.StartEngine object at 0x10c4e7a30>)
2022-09-13 11:03:56.898738 p44884 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999878-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/actor.py, 124, "Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention ...-quit_0:04:59.999859)
2022-09-13 11:03:56.902820 p44884 dbg  Attempting intent TransportIntent(ActorAddr-LocalAddr.0-pending-ExpiresIn_0:04:59.999394-<class 'esrally.mechanic.mechanic.StartNodes'>-<esrally.mechanic.mechanic.StartNodes object at 0x10c4e79d0>-quit_0:04:59.999342)
2022-09-13 11:03:56.903376 p44884 dbg  CannotPickleAddress, appending intent for ActorAddr-LocalAddr.0
2022-09-13 11:03:56.905181 p44885 I    Starting Actor esrally.mechanic.mechanic.NodeMechanicActor at ActorAddr-(T|:59141) (parent ActorAddr-(T|:59138), admin ActorAddr-(T|:1900), srcHash None)
2022-09-13 11:03:56.906326 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.999581-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected object at 0x10c4e47f0>-quit_0:04:59.999537)
2022-09-13 11:03:56.911268 p44884 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59141), <class 'thespian.system.messages.multiproc.EndpointConnected'> msg: <thespian.system.messages.multiproc.EndpointConnected object at 0x10c4e7c40>)
2022-09-13 11:03:56.912409 p44884 dbg  Attempting intent TransportIntent(ActorAddr-0-ActorAddr-(T|:59141)-pending-ExpiresIn_0:04:59.989520-<class 'esrally.mechanic.mechanic.StartNodes'>-<esrally.mechanic.mechanic.StartNodes object at 0x10c4e79d0>-quit_0:04:59.989499)
2022-09-13 11:03:56.913529 p44884 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59141)-pending-ExpiresIn_0:04:59.988366-<class 'esrally.mechanic.mechanic.StartNodes'>-<esrally.mechanic.mechanic.StartNodes object at 0x10c4e79d0>-quit_0:04:59.988347)
2022-09-13 11:03:56.914883 p44885 dbg  ACTOR got ReceiveEnvelope(from: ActorAddr-(T|:59138), <class 'esrally.mechanic.mechanic.StartNodes'> msg: <esrally.mechanic.mechanic.StartNodes object at 0x10c4e57e0>)
2022-09-13 11:03:56.915862 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999846-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/mechanic.py, 552, "Starting node(s) [0] on [127.0.0.1].">-quit_0:04:59.999826)
2022-09-13 11:03:56.920998 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999909-<class 'logging.LogRecord'>-<LogRecord: esrally.metrics, 20, /Users/jbryan/dev/src/rally/esrally/metrics.py, 408, "Opening metrics store for race timestamp=[20220913T150348Z], tr...-quit_0:04:59.99
9889)
2022-09-13 11:03:57.165369 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999780-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.repo, 20, /Users/jbryan/dev/src/rally/esrally/utils/repo.py, 127, "Checking out revision [c18ab4d] in [/Users/jbryan/.rally/...-quit_0:04:59.99
9754)
2022-09-13 11:03:57.199535 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999768-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Note: switching to 'c18ab4d'.

You are in 'detached ...-quit_0:04:59.999741)
2022-09-13 11:03:57.202417 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999928-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.team, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/team.py, 339, "Loading plugin [analysis-icu] with default configur...-quit_0:04:59.99
9909)
2022-09-13 11:03:57.204056 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999933-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 74, "Enabling source artifact caching.">-quit_0:04:59.999915)
2022-09-13 11:03:57.206042 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999941-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 115, "Adding core plugin source supplier for [ana...-quit_0:04:59.99
9924)
2022-09-13 11:03:57.206654 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999943-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.java_resolver, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/java_resolver.py, 44, "Using JDK bundled with Elasticsear...-quit_0:04:59.99
9926)

Noting a break here since it takes a moment to download ES source.

2022-09-13 11:05:00.271099 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999732-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Cloning into '/Users/jbryan/.rally/benchmarks/src/el...-quit_0:04:59.99
9702)
2022-09-13 11:05:00.271966 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999918-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 630, "Fetching from remote and checking out revis...-quit_0:04:59.99
9898)
2022-09-13 11:05:01.300922 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999746-<class 'logging.LogRecord'>-<LogRecord: esrally.utils.process, 20, /Users/jbryan/dev/src/rally/esrally/utils/process.py, 95, "Note: switching to '2c4731e900b4da94503fa570a27547c2...-quit_0:04:59.99
9719)
2022-09-13 11:05:01.319588 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999767-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 640, "User-specified revision [@2022-09-11] for [...-quit_0:04:59.99
9741)
2022-09-13 11:05:01.320486 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999902-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 626, "Skip fetching sources for Elasticsearch.">-quit_0:04:59.999883
)
2022-09-13 11:05:01.347340 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999706-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 640, "User-specified revision [current] for [Elas...-quit_0:04:59.99
9677)
2022-09-13 11:05:01.348167 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999928-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 342, "Using cached artifact in [/Users/jbryan/.ra...-quit_0:04:59.99
9908)
2022-09-13 11:05:01.348863 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999887-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.supplier, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/supplier.py, 342, "Using cached artifact in [/Users/jbryan/.ra...-quit_0:04:59.99
9866)
2022-09-13 11:05:01.349631 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999925-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.provisioner, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/provisioner.py, 275, "Preparing candidate locally in [/User...-quit_0:04:59.99
9905)
2022-09-13 11:05:01.353926 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999668-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.provisioner, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/provisioner.py, 280, "Unzipping /Users/jbryan/.rally/benchm...-quit_0:04:59.999645)
2022-09-13 11:05:05.310665 p44885 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:59095)-pending-ExpiresIn_0:04:59.999753-<class 'logging.LogRecord'>-<LogRecord: esrally.mechanic.provisioner, 20, /Users/jbryan/dev/src/rally/esrally/mechanic/provisioner.py, 287, "Deleting pre-bundled Elasticsearch co...-quit_0:04:59.999692)
2022-09-13 11:05:21.996964 p44885 dbg  Attempting intent TransportIntent(ActorAddr-(T|:59122)-pending-ExpiresIn_0:04:59.999525-<class 'esrally.mechanic.mechanic.NodesStarted'>-<esrally.mechanic.mechanic.NodesStarted object at 0x10c4f2fe0>-quit_0:04:59.999501)
2022-09-13 11:05:22.002274 p44885 ERR  Actor esrally.mechanic.mechanic.NodeMechanicActor @ ActorAddr-(T|:59141) transport run exception: Traceback (most recent call last):
  File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/actorManager.py", line 87, in run
    r = self.transport.run(self.handleMessages)
  File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 71, in run
    rval = self._run_subtransport(incomingHandler, max_runtime)
  File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/wakeupTransportBase.py", line 80, in _run_subtransport
    rval = self._runWithExpiry(incomingHandler)
  File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/TCPTransport.py", line 1219, in _runWithExpiry
    self._acceptNewIncoming()
  File "/Users/jbryan/dev/src/rally/.venv/lib/python3.10/site-packages/thespian/system/transport/TCPTransport.py", line 1346, in _acceptNewIncoming
    lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
OSError: [Errno 22] Invalid argument

2022-09-13 11:05:22.003103 p44885 dbg  Attempting intent TransportIntent(ActorAddr-(T|:59138)-pending-ExpiresIn_0:04:59.999732-<class 'thespian.actors.ChildActorExited'>-ChildActorExited:ActorAddr-(T|:59141)-quit_0:04:59.999701)

The Actor system error occurring right at esrally.mechanic.mechanic.NodesStarted is suspicious. Like @b-deam says, I think it is going to require some minimal reproduction to track it down.

pquentin commented 2 years ago

I reported the issue upstream: https://github.com/thespianpy/Thespian/issues/70

pquentin commented 2 years ago

Right now if anyone faces the issue I'd recommend to apply the local workaround until we hear more from the Thespian side.

inqueue commented 1 year ago

I encountered another way to reproduce the issue after forgetting to use the workaround. The stall can happen directly after downloading elastic/logs track corpora and building file offsets. It reproduces pretty reliably on macOS.

https://gist.github.com/inqueue/ddb5c28bb0512ebdc26391bd32e10d91