Avocado runner went <defunct> without sufficient debugging information while running the same unit test

Is your feature request related to a problem? Please describe. While running Avocado's self tests (and trying to incorporate them in our own CI), the run froze at a test (>10 hours now but technically indefinitely):

 (check-675/1048) selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: STARTED
 (check-675/1048) selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: PASS (0.66 s)
 (check-676/1048) selftests/functional/test_basic.py:RunnerOperationTest.test_show_version: STARTED

Investigating the test process seems like it went defunct:

  28897 pts/4    S+     4:13  |                   |               \_ /usr/bin/python3 /root/rpmbuild/QA/BUILDROOT/python-avocado-91.0-1.fc33.x86_64/usr/bin/avocado-runner-python-unittest task-run -i check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -j 6ad308aa80ec4cbdfa79d964a582d482752a2c4d -k python-unittest -u selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -c {"core.input_encoding": "utf-8", "core.show": {"__encoded_set__": ["app"]}, "datadir.paths.base_dir": "/root/avocado", "datadir.paths.cache_dirs": ["/root/avocado/data/cache"], "datadir.paths.data_dir": "/root/avocado/data", "datadir.paths.logs_dir": "/mnt/local/results", "datadir.paths.test_dir": "/usr/share/doc/avocado/tests", "human_ui.omit.statuses": [], "job.output.loglevel": "DEBUG", "job.output.testlogs.logfiles": ["debug.log"], "job.output.testlogs.statuses": ["FAIL"], "job.run.result.html.enabled": true, "job.run.result.html.open_browser": false, "job.run.result.html.output": null, "job.run.result.json.enabled": true, "job.run.result.json.output": null, "job.run.result.tap.enabled": true, "job.run.result.tap.include_logs": false, "job.run.result.tap.output": null, "job.run.result.xunit.enabled": true, "job.run.result.xunit.job_name": null, "job.run.result.xunit.max_test_log_chars": 100000, "job.run.result.xunit.output": null, "job.run.store_logging_stream": ["avocado.core:DEBUG"], "job.run.timeout": 0, "json.variants.load": null, "nrunner.max_parallel_tasks": 1, "nrunner.shuffle": false, "nrunner.spawner": "process", "nrunner.status_server_auto": true, "nrunner.status_server_buffer_size": 33554432, "nrunner.status_server_listen": "127.0.0.1:8888", "nrunner.status_server_uri": "127.0.0.1:8888", "plugins.cli.cmd.order": [], "plugins.cli.order": [], "plugins.disable": [], "plugins.init.order": [], "plugins.job.prepost.order": [], "plugins.jobscripts.post": "/etc/avocado/scripts/job/post.d/", "plugins.jobscripts.pre": "/etc/avocado/scripts/job/pre.d/", "plugins.jobscripts.warn_non_existing_dir": false, "plugins.jobscripts.warn_non_zero_status": true, "plugins.resolver.order": [], "plugins.result.order": [], "plugins.result_events.order": [], "plugins.runnable.runner.order": [], "plugins.runner.order": [], "plugins.skip_broken_plugin_notification": [], "plugins.spawner.order": [], "plugins.varianter.order": [], "plugins.vtjoblock.dir": "/tmp", "run.dict_variants": [], "run.execution_order": "variants-per-test", "run.references": ["selftests/unit/", "selftests/jobs/", "selftests/functional/", "optional_plugins/golang/tests/src", "optional_plugins/golang/tests/test_golang.py", "optional_plugins/golang/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_basic.py", "optional_plugins/varianter_cit/tests/test_combinationRow.py", "optional_plugins/varianter_cit/tests/test_combinationMatrix.py", "optional_plugins/varianter_cit/tests/test_solver.py", "optional_plugins/varianter_cit/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_cit.py", "optional_plugins/html/tests/test_html_result.py", "optional_plugins/html/tests/__init__.py", "optional_plugins/varianter_yaml_to_mux/tests/test_unit.py", "optional_plugins/varianter_yaml_to_mux/tests/test_functional.py", "optional_plugins/varianter_yaml_to_mux/tests/__init__.py"], "runner.exectest.exitcodes.skip": [], "runner.output.color": "always", "runner.output.colored": true, "runner.output.utf8": true, "runner.timeout.after_interrupted": 60, "runner.timeout.process_alive": 60, "runner.timeout.process_died": 10, "simpletests.status.failure_fields": ["status", "stdout", "stderr"], "simpletests.status.skip_location": "all", "simpletests.status.skip_regex": "^SKIP$", "simpletests.status.warn_location": "all", "simpletests.status.warn_regex": "^WARN$", "spawner.podman.bin": "/usr/bin/podman", "spawner.podman.image": "fedora:33", "sysinfo.collect.commands_timeout": -1, "sysinfo.collect.enabled": true, "sysinfo.collect.installed_packages": false, "sysinfo.collect.locale": "C", "sysinfo.collect.optimize": false, "sysinfo.collect.per_test": false, "sysinfo.collect.profiler": false, "sysinfo.collectibles.commands": "/etc/avocado/sysinfo/commands", "sysinfo.collectibles.fail_commands": "/root/rpmbuild/QA/BUILDROOT/python-avocado-91.0-1.fc33.x86_64/usr/lib/python3.9/site-packages/avocado/etc/avocado/sysinfo/fail_commands", "sysinfo.collectibles.fail_files": "/root/rpmbuild/QA/BUILDROOT/python-avocado-91.0-1.fc33.x86_64/usr/lib/python3.9/site-packages/avocado/etc/avocado/sysinfo/fail_files", "sysinfo.collectibles.files": "/etc/avocado/sysinfo/files", "sysinfo.collectibles.profilers": "/etc/avocado/sysinfo/profilers", "task.timeout.running": null, "vt.common.arch": null, "vt.common.data_dir": "", "vt.common.machine_type": "i440fx", "vt.common.mem": null, "vt.common.netdst": "virbr0", "vt.common.nettype": "bridge", "vt.common.tmp_dir": "/tmp", "vt.common.type_specific_only": false, "vt.config": null, "vt.debug.no_cleanup": false, "vt.extra_params": [], "vt.filter.default_filters": null, "vt.guest_os": "JeOS.27", "vt.libvirt.connect_uri": null, "vt.list_archs": false, "vt.list_guests": false, "vt.no_filter": "", "vt.only_filter": "", "vt.qemu.accel": "kvm", "vt.qemu.defconfig": "yes", "vt.qemu.disk_bus": "virtio_blk", "vt.qemu.image_type": "qcow2", "vt.qemu.malloc_perturb": "yes", "vt.qemu.monitor": "", "vt.qemu.nic_model": "virtio_net", "vt.qemu.qemu_bin": null, "vt.qemu.qemu_dst_bin": "/usr/bin/qemu-kvm", "vt.qemu.sandbox": "on", "vt.qemu.smp": "2", "vt.qemu.vhost": "off", "vt.save_config": null, "vt.setup.backup_image_before_test": true, "vt.setup.keep_guest_running": false, "vt.setup.restore_image_after_test": true, "vt.short_names_when_config": false, "vt.type": "qemu", "yaml_to_mux.files": [], "yaml_to_mux.filter_only": [], "yaml_to_mux.filter_out": [], "yaml_to_mux.inject": [], "yaml_to_mux.parameter_paths": ["/run/*"], "run.test_runner": "nrunner", "run.ignore_missing_references": true} tags=json:{} variant=json:{"paths": ["/run/*"], "variant_id": null, "variant": [["/", []]]} -s /mnt/local/results/job-2021-10-12T20.53-6ad308a/.status_server.sock
  28899 pts/4    Z+     0:00  |                   |                   \_ [avocado-runner-] <defunct>

Looking into the debug.log shows nothing as it seems to be empty, looking into the job.log shows normal start:

2021-10-12 21:07:13,833 testlogs         L0094 INFO | selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: STARTED
2021-10-12 21:07:14,497 testlogs         L0101 INFO | selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: PASS
2021-10-12 21:07:14,497 testlogs         L0103 INFO | More information in /mnt/local/results/job-2021-10-12T20.53-6ad308a/test-results/check-675-selftests_functional_test_argument_parsing.py_ArgumentParsingErrorEarlyTest.test_whacky_option
2021-10-12 21:07:15,264 testlogs         L0094 INFO | selftests/functional/test_basic.py:RunnerOperationTest.test_show_version: STARTED

and looking into the avocado.core.DEBUG also shows nothing of interest:

21:07:13 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option": spawned successfully
21:07:13 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option" started message: "{'status': 'started', 'time': 3639618.380414009, 'output_dir': '/tmp/.avocado-task-0_dzm10g', 'id': 'check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option'}"
21:07:14 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option" finished message: "{'status': 'finished', 'result': 'pass', 'time': 3639619.039230097}"
21:07:14 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option" finished
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": requested -> triaging
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": requirements OK (will proceed to check dependencies)
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": about to be spawned with "<avocado.plugins.spawners.process.ProcessSpawner object at 0x7f15f43709a0>"
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": spawned successfully
21:07:15 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version" started message: "{'status': 'started', 'time': 3639619.783357598, 'output_dir': '/tmp/.avocado-task-hpgtrro_', 'id': 'check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version'}"

Describe the solution you'd like It would be great if we have a better source for debugging information as all the clues about why the test went defunct

Describe alternatives you've considered All typical sources of debugging information above.

Additional information I can post additional findings below in case I find any useful direction.

Replying to @clebergnu from the indirectly related 5033:

Trying to understand if this is test related, or test runner related. It should be possible to run everything with all the tests with the legacy runner to get a sense of what's the most probable culprit here.

The way I see it, it is a avocado-runner-python-unittest subprocess crash while the test goes on but its result then is never collected (because the master runner process seems to be missing too):

# ps afx | grep avocado
  36130 pts/4    S+     0:00  |                   \_ /usr/bin/python3 /root/.local/bin/avocado-runner-python-unittest task-run -i check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -j 78605ba55147a1a2ba8d29563979fd4a3f0fd6eb -k python-unittest -u selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -c {"core.input_encoding": "utf-8", "core.show": {"__encoded_set__": ["app"]}, "datadir.paths.base_dir": "/root/avocado", "datadir.paths.cache_dirs": ["/root/avocado/data/cache"], "datadir.paths.data_dir": "/root/avocado/data", "datadir.paths.logs_dir": "/mnt/local/results", "datadir.paths.test_dir": "/usr/share/doc/avocado/tests", "human_ui.omit.statuses": [], "job.output.loglevel": "DEBUG", "job.output.testlogs.logfiles": ["debug.log"], "job.output.testlogs.statuses": [], "job.run.result.html.enabled": true, "job.run.result.html.open_browser": false, "job.run.result.html.output": null, "job.run.result.json.enabled": true, "job.run.result.json.output": null, "job.run.result.tap.enabled": true, "job.run.result.tap.include_logs": false, "job.run.result.tap.output": null, "job.run.result.xunit.enabled": true, "job.run.result.xunit.job_name": null, "job.run.result.xunit.max_test_log_chars": 100000, "job.run.result.xunit.output": null, "job.run.store_logging_stream": ["avocado.core:DEBUG"], "job.run.timeout": 0, "json.variants.load": null, "nrunner.max_parallel_tasks": 1, "nrunner.shuffle": false, "nrunner.spawner": "process", "nrunner.status_server_auto": true, "nrunner.status_server_buffer_size": 33554432, "nrunner.status_server_listen": "127.0.0.1:8888", "nrunner.status_server_uri": "127.0.0.1:8888", "plugins.cli.cmd.order": [], "plugins.cli.order": [], "plugins.disable": [], "plugins.init.order": [], "plugins.job.prepost.order": [], "plugins.jobscripts.post": "/etc/avocado/scripts/job/post.d/", "plugins.jobscripts.pre": "/etc/avocado/scripts/job/pre.d/", "plugins.jobscripts.warn_non_existing_dir": false, "plugins.jobscripts.warn_non_zero_status": true, "plugins.resolver.order": [], "plugins.result.order": [], "plugins.result_events.order": [], "plugins.runnable.runner.order": [], "plugins.runner.order": [], "plugins.skip_broken_plugin_notification": [], "plugins.spawner.order": [], "plugins.varianter.order": [], "plugins.vtjoblock.dir": "/tmp", "run.dict_variants": [], "run.execution_order": "variants-per-test", "run.references": ["selftests/unit/", "selftests/jobs/", "selftests/functional/", "selftests/inspekt-style.sh", "selftests/lint.sh", "selftests/inspekt-indent.sh", "selftests/spell.sh", "selftests/signedoff-check.sh", "selftests/safeloader.sh", "selftests/isort.sh", "selftests/modules-boundaries.sh", "optional_plugins/golang/tests/src", "optional_plugins/golang/tests/test_golang.py", "optional_plugins/golang/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_basic.py", "optional_plugins/varianter_cit/tests/test_combinationRow.py", "optional_plugins/varianter_cit/tests/test_combinationMatrix.py", "optional_plugins/varianter_cit/tests/test_solver.py", "optional_plugins/varianter_cit/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_cit.py", "optional_plugins/robot/tests/test_robot.py", "optional_plugins/robot/tests/avocado.robot", "optional_plugins/robot/tests/__init__.py", "optional_plugins/html/tests/test_html_result.py", "optional_plugins/html/tests/__init__.py", "optional_plugins/varianter_yaml_to_mux/tests/test_unit.py", "optional_plugins/varianter_yaml_to_mux/tests/test_functional.py", "optional_plugins/varianter_yaml_to_mux/tests/__init__.py"], "runner.exectest.exitcodes.skip": [], "runner.output.color": "always", "runner.output.colored": true, "runner.output.utf8": true, "runner.timeout.after_interrupted": 60, "runner.timeout.process_alive": 60, "runner.timeout.process_died": 10, "simpletests.status.failure_fields": ["status", "stdout", "stderr"], "simpletests.status.skip_location": "all", "simpletests.status.skip_regex": "^SKIP$", "simpletests.status.warn_location": "all", "simpletests.status.warn_regex": "^WARN$", "spawner.podman.bin": "/usr/bin/podman", "spawner.podman.image": "fedora:33", "sysinfo.collect.commands_timeout": -1, "sysinfo.collect.enabled": true, "sysinfo.collect.installed_packages": false, "sysinfo.collect.locale": "C", "sysinfo.collect.optimize": false, "sysinfo.collect.per_test": false, "sysinfo.collect.profiler": false, "sysinfo.collectibles.commands": "/etc/avocado/sysinfo/commands", "sysinfo.collectibles.fail_commands": "/root/rpmbuild/QA/BUILD/avocado-91.0/avocado/etc/avocado/sysinfo/fail_commands", "sysinfo.collectibles.fail_files": "/root/rpmbuild/QA/BUILD/avocado-91.0/avocado/etc/avocado/sysinfo/fail_files", "sysinfo.collectibles.files": "/etc/avocado/sysinfo/files", "sysinfo.collectibles.profilers": "/etc/avocado/sysinfo/profilers", "task.timeout.running": null, "vt.common.arch": null, "vt.common.data_dir": "", "vt.common.machine_type": "i440fx", "vt.common.mem": null, "vt.common.netdst": "virbr0", "vt.common.nettype": "bridge", "vt.common.tmp_dir": "/tmp", "vt.common.type_specific_only": false, "vt.config": null, "vt.debug.no_cleanup": false, "vt.extra_params": [], "vt.filter.default_filters": null, "vt.guest_os": "JeOS.27", "vt.libvirt.connect_uri": null, "vt.list_archs": false, "vt.list_guests": false, "vt.no_filter": "", "vt.only_filter": "", "vt.qemu.accel": "kvm", "vt.qemu.defconfig": "yes", "vt.qemu.disk_bus": "virtio_blk", "vt.qemu.image_type": "qcow2", "vt.qemu.malloc_perturb": "yes", "vt.qemu.monitor": "", "vt.qemu.nic_model": "virtio_net", "vt.qemu.qemu_bin": null, "vt.qemu.qemu_dst_bin": "/usr/bin/qemu-kvm", "vt.qemu.sandbox": "on", "vt.qemu.smp": "2", "vt.qemu.vhost": "off", "vt.save_config": null, "vt.setup.backup_image_before_test": true, "vt.setup.keep_guest_running": false, "vt.setup.restore_image_after_test": true, "vt.short_names_when_config": false, "vt.type": "qemu", "yaml_to_mux.files": [], "yaml_to_mux.filter_only": [], "yaml_to_mux.filter_out": [], "yaml_to_mux.inject": [], "yaml_to_mux.parameter_paths": ["/run/*"], "run.test_runner": "nrunner", "run.ignore_missing_references": true} tags=json:{} variant=json:{"paths": ["/run/*"], "variant_id": null, "variant": [["/", []]]} -s /mnt/local/results/job-2021-10-15T15.56-78605ba/.status_server.sock
  36132 pts/4    Z+     0:00  |                       \_ [avocado-runner-] <defunct>

Trying to replicate your build environment. I remember you did post somewhere your "recipe" (LXC container, exact commands), but I couldn't find it easily. If you could please post it here again, I'll give that a try.

I can't recall either. If you simply manage to install an LXC container like

$ lxc-create -t download -n c1 -- --dist fedora --release 33 --arch amd64 \
>     --keyserver hkp://keys.openpgp.org:80 \

and provide networking, it should be reproducible with just cloning the official avocado repo from within and running the unit tests. I am not sure if this is helpful but if you cannot reproduce it that way or it proves too hard let me know.

There is also a third option which is why this issue is stated as a feature request - try to provide more information about what happened to the processes. For instance, instead of taking the steps you would usually take in order to figure this out in a reproduced environment, you could mention some of these steps here and I could try them out, then we could think about which ones of them could be integrated for better debugging verbosity of such situations.

avocado-framework / avocado

Avocado runner went <defunct> without sufficient debugging information while running the same unit test #5031