avocado-framework / avocado

Avocado is a set of tools and libraries to help with automated testing. One can call it a test framework with benefits. Native tests are written in Python and they follow the unittest pattern, but any executable can serve as a test.
https://avocado-framework.github.io/
Other
338 stars 335 forks source link

Avocado runner went <defunct> without sufficient debugging information while running the same unit test #5031

Open pevogam opened 2 years ago

pevogam commented 2 years ago

Is your feature request related to a problem? Please describe. While running Avocado's self tests (and trying to incorporate them in our own CI), the run froze at a test (>10 hours now but technically indefinitely):

 (check-675/1048) selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: STARTED
 (check-675/1048) selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: PASS (0.66 s)
 (check-676/1048) selftests/functional/test_basic.py:RunnerOperationTest.test_show_version: STARTED

Investigating the test process seems like it went defunct:

  28897 pts/4    S+     4:13  |                   |               \_ /usr/bin/python3 /root/rpmbuild/QA/BUILDROOT/python-avocado-91.0-1.fc33.x86_64/usr/bin/avocado-runner-python-unittest task-run -i check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -j 6ad308aa80ec4cbdfa79d964a582d482752a2c4d -k python-unittest -u selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -c {"core.input_encoding": "utf-8", "core.show": {"__encoded_set__": ["app"]}, "datadir.paths.base_dir": "/root/avocado", "datadir.paths.cache_dirs": ["/root/avocado/data/cache"], "datadir.paths.data_dir": "/root/avocado/data", "datadir.paths.logs_dir": "/mnt/local/results", "datadir.paths.test_dir": "/usr/share/doc/avocado/tests", "human_ui.omit.statuses": [], "job.output.loglevel": "DEBUG", "job.output.testlogs.logfiles": ["debug.log"], "job.output.testlogs.statuses": ["FAIL"], "job.run.result.html.enabled": true, "job.run.result.html.open_browser": false, "job.run.result.html.output": null, "job.run.result.json.enabled": true, "job.run.result.json.output": null, "job.run.result.tap.enabled": true, "job.run.result.tap.include_logs": false, "job.run.result.tap.output": null, "job.run.result.xunit.enabled": true, "job.run.result.xunit.job_name": null, "job.run.result.xunit.max_test_log_chars": 100000, "job.run.result.xunit.output": null, "job.run.store_logging_stream": ["avocado.core:DEBUG"], "job.run.timeout": 0, "json.variants.load": null, "nrunner.max_parallel_tasks": 1, "nrunner.shuffle": false, "nrunner.spawner": "process", "nrunner.status_server_auto": true, "nrunner.status_server_buffer_size": 33554432, "nrunner.status_server_listen": "127.0.0.1:8888", "nrunner.status_server_uri": "127.0.0.1:8888", "plugins.cli.cmd.order": [], "plugins.cli.order": [], "plugins.disable": [], "plugins.init.order": [], "plugins.job.prepost.order": [], "plugins.jobscripts.post": "/etc/avocado/scripts/job/post.d/", "plugins.jobscripts.pre": "/etc/avocado/scripts/job/pre.d/", "plugins.jobscripts.warn_non_existing_dir": false, "plugins.jobscripts.warn_non_zero_status": true, "plugins.resolver.order": [], "plugins.result.order": [], "plugins.result_events.order": [], "plugins.runnable.runner.order": [], "plugins.runner.order": [], "plugins.skip_broken_plugin_notification": [], "plugins.spawner.order": [], "plugins.varianter.order": [], "plugins.vtjoblock.dir": "/tmp", "run.dict_variants": [], "run.execution_order": "variants-per-test", "run.references": ["selftests/unit/", "selftests/jobs/", "selftests/functional/", "optional_plugins/golang/tests/src", "optional_plugins/golang/tests/test_golang.py", "optional_plugins/golang/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_basic.py", "optional_plugins/varianter_cit/tests/test_combinationRow.py", "optional_plugins/varianter_cit/tests/test_combinationMatrix.py", "optional_plugins/varianter_cit/tests/test_solver.py", "optional_plugins/varianter_cit/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_cit.py", "optional_plugins/html/tests/test_html_result.py", "optional_plugins/html/tests/__init__.py", "optional_plugins/varianter_yaml_to_mux/tests/test_unit.py", "optional_plugins/varianter_yaml_to_mux/tests/test_functional.py", "optional_plugins/varianter_yaml_to_mux/tests/__init__.py"], "runner.exectest.exitcodes.skip": [], "runner.output.color": "always", "runner.output.colored": true, "runner.output.utf8": true, "runner.timeout.after_interrupted": 60, "runner.timeout.process_alive": 60, "runner.timeout.process_died": 10, "simpletests.status.failure_fields": ["status", "stdout", "stderr"], "simpletests.status.skip_location": "all", "simpletests.status.skip_regex": "^SKIP$", "simpletests.status.warn_location": "all", "simpletests.status.warn_regex": "^WARN$", "spawner.podman.bin": "/usr/bin/podman", "spawner.podman.image": "fedora:33", "sysinfo.collect.commands_timeout": -1, "sysinfo.collect.enabled": true, "sysinfo.collect.installed_packages": false, "sysinfo.collect.locale": "C", "sysinfo.collect.optimize": false, "sysinfo.collect.per_test": false, "sysinfo.collect.profiler": false, "sysinfo.collectibles.commands": "/etc/avocado/sysinfo/commands", "sysinfo.collectibles.fail_commands": "/root/rpmbuild/QA/BUILDROOT/python-avocado-91.0-1.fc33.x86_64/usr/lib/python3.9/site-packages/avocado/etc/avocado/sysinfo/fail_commands", "sysinfo.collectibles.fail_files": "/root/rpmbuild/QA/BUILDROOT/python-avocado-91.0-1.fc33.x86_64/usr/lib/python3.9/site-packages/avocado/etc/avocado/sysinfo/fail_files", "sysinfo.collectibles.files": "/etc/avocado/sysinfo/files", "sysinfo.collectibles.profilers": "/etc/avocado/sysinfo/profilers", "task.timeout.running": null, "vt.common.arch": null, "vt.common.data_dir": "", "vt.common.machine_type": "i440fx", "vt.common.mem": null, "vt.common.netdst": "virbr0", "vt.common.nettype": "bridge", "vt.common.tmp_dir": "/tmp", "vt.common.type_specific_only": false, "vt.config": null, "vt.debug.no_cleanup": false, "vt.extra_params": [], "vt.filter.default_filters": null, "vt.guest_os": "JeOS.27", "vt.libvirt.connect_uri": null, "vt.list_archs": false, "vt.list_guests": false, "vt.no_filter": "", "vt.only_filter": "", "vt.qemu.accel": "kvm", "vt.qemu.defconfig": "yes", "vt.qemu.disk_bus": "virtio_blk", "vt.qemu.image_type": "qcow2", "vt.qemu.malloc_perturb": "yes", "vt.qemu.monitor": "", "vt.qemu.nic_model": "virtio_net", "vt.qemu.qemu_bin": null, "vt.qemu.qemu_dst_bin": "/usr/bin/qemu-kvm", "vt.qemu.sandbox": "on", "vt.qemu.smp": "2", "vt.qemu.vhost": "off", "vt.save_config": null, "vt.setup.backup_image_before_test": true, "vt.setup.keep_guest_running": false, "vt.setup.restore_image_after_test": true, "vt.short_names_when_config": false, "vt.type": "qemu", "yaml_to_mux.files": [], "yaml_to_mux.filter_only": [], "yaml_to_mux.filter_out": [], "yaml_to_mux.inject": [], "yaml_to_mux.parameter_paths": ["/run/*"], "run.test_runner": "nrunner", "run.ignore_missing_references": true} tags=json:{} variant=json:{"paths": ["/run/*"], "variant_id": null, "variant": [["/", []]]} -s /mnt/local/results/job-2021-10-12T20.53-6ad308a/.status_server.sock
  28899 pts/4    Z+     0:00  |                   |                   \_ [avocado-runner-] <defunct>

Looking into the debug.log shows nothing as it seems to be empty, looking into the job.log shows normal start:

2021-10-12 21:07:13,833 testlogs         L0094 INFO | selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: STARTED
2021-10-12 21:07:14,497 testlogs         L0101 INFO | selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option: PASS
2021-10-12 21:07:14,497 testlogs         L0103 INFO | More information in /mnt/local/results/job-2021-10-12T20.53-6ad308a/test-results/check-675-selftests_functional_test_argument_parsing.py_ArgumentParsingErrorEarlyTest.test_whacky_option
2021-10-12 21:07:15,264 testlogs         L0094 INFO | selftests/functional/test_basic.py:RunnerOperationTest.test_show_version: STARTED

and looking into the avocado.core.DEBUG also shows nothing of interest:

21:07:13 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option": spawned successfully
21:07:13 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option" started message: "{'status': 'started', 'time': 3639618.380414009, 'output_dir': '/tmp/.avocado-task-0_dzm10g', 'id': 'check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option'}"
21:07:14 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option" finished message: "{'status': 'finished', 'result': 'pass', 'time': 3639619.039230097}"
21:07:14 DEBUG| Task "check-675-selftests/functional/test_argument_parsing.py:ArgumentParsingErrorEarlyTest.test_whacky_option" finished
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": requested -> triaging
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": requirements OK (will proceed to check dependencies)
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": about to be spawned with "<avocado.plugins.spawners.process.ProcessSpawner object at 0x7f15f43709a0>"
21:07:14 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version": spawned successfully
21:07:15 DEBUG| Task "check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version" started message: "{'status': 'started', 'time': 3639619.783357598, 'output_dir': '/tmp/.avocado-task-hpgtrro_', 'id': 'check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version'}"

Describe the solution you'd like It would be great if we have a better source for debugging information as all the clues about why the test went defunct

Describe alternatives you've considered All typical sources of debugging information above.

Additional information I can post additional findings below in case I find any useful direction.

pevogam commented 2 years ago

Replying to @clebergnu from the indirectly related 5033:

  • Trying to understand if this is test related, or test runner related. It should be possible to run everything with all the tests with the legacy runner to get a sense of what's the most probable culprit here.

The way I see it, it is a avocado-runner-python-unittest subprocess crash while the test goes on but its result then is never collected (because the master runner process seems to be missing too):

# ps afx | grep avocado
  36130 pts/4    S+     0:00  |                   \_ /usr/bin/python3 /root/.local/bin/avocado-runner-python-unittest task-run -i check-676-selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -j 78605ba55147a1a2ba8d29563979fd4a3f0fd6eb -k python-unittest -u selftests/functional/test_basic.py:RunnerOperationTest.test_show_version -c {"core.input_encoding": "utf-8", "core.show": {"__encoded_set__": ["app"]}, "datadir.paths.base_dir": "/root/avocado", "datadir.paths.cache_dirs": ["/root/avocado/data/cache"], "datadir.paths.data_dir": "/root/avocado/data", "datadir.paths.logs_dir": "/mnt/local/results", "datadir.paths.test_dir": "/usr/share/doc/avocado/tests", "human_ui.omit.statuses": [], "job.output.loglevel": "DEBUG", "job.output.testlogs.logfiles": ["debug.log"], "job.output.testlogs.statuses": [], "job.run.result.html.enabled": true, "job.run.result.html.open_browser": false, "job.run.result.html.output": null, "job.run.result.json.enabled": true, "job.run.result.json.output": null, "job.run.result.tap.enabled": true, "job.run.result.tap.include_logs": false, "job.run.result.tap.output": null, "job.run.result.xunit.enabled": true, "job.run.result.xunit.job_name": null, "job.run.result.xunit.max_test_log_chars": 100000, "job.run.result.xunit.output": null, "job.run.store_logging_stream": ["avocado.core:DEBUG"], "job.run.timeout": 0, "json.variants.load": null, "nrunner.max_parallel_tasks": 1, "nrunner.shuffle": false, "nrunner.spawner": "process", "nrunner.status_server_auto": true, "nrunner.status_server_buffer_size": 33554432, "nrunner.status_server_listen": "127.0.0.1:8888", "nrunner.status_server_uri": "127.0.0.1:8888", "plugins.cli.cmd.order": [], "plugins.cli.order": [], "plugins.disable": [], "plugins.init.order": [], "plugins.job.prepost.order": [], "plugins.jobscripts.post": "/etc/avocado/scripts/job/post.d/", "plugins.jobscripts.pre": "/etc/avocado/scripts/job/pre.d/", "plugins.jobscripts.warn_non_existing_dir": false, "plugins.jobscripts.warn_non_zero_status": true, "plugins.resolver.order": [], "plugins.result.order": [], "plugins.result_events.order": [], "plugins.runnable.runner.order": [], "plugins.runner.order": [], "plugins.skip_broken_plugin_notification": [], "plugins.spawner.order": [], "plugins.varianter.order": [], "plugins.vtjoblock.dir": "/tmp", "run.dict_variants": [], "run.execution_order": "variants-per-test", "run.references": ["selftests/unit/", "selftests/jobs/", "selftests/functional/", "selftests/inspekt-style.sh", "selftests/lint.sh", "selftests/inspekt-indent.sh", "selftests/spell.sh", "selftests/signedoff-check.sh", "selftests/safeloader.sh", "selftests/isort.sh", "selftests/modules-boundaries.sh", "optional_plugins/golang/tests/src", "optional_plugins/golang/tests/test_golang.py", "optional_plugins/golang/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_basic.py", "optional_plugins/varianter_cit/tests/test_combinationRow.py", "optional_plugins/varianter_cit/tests/test_combinationMatrix.py", "optional_plugins/varianter_cit/tests/test_solver.py", "optional_plugins/varianter_cit/tests/__init__.py", "optional_plugins/varianter_cit/tests/test_cit.py", "optional_plugins/robot/tests/test_robot.py", "optional_plugins/robot/tests/avocado.robot", "optional_plugins/robot/tests/__init__.py", "optional_plugins/html/tests/test_html_result.py", "optional_plugins/html/tests/__init__.py", "optional_plugins/varianter_yaml_to_mux/tests/test_unit.py", "optional_plugins/varianter_yaml_to_mux/tests/test_functional.py", "optional_plugins/varianter_yaml_to_mux/tests/__init__.py"], "runner.exectest.exitcodes.skip": [], "runner.output.color": "always", "runner.output.colored": true, "runner.output.utf8": true, "runner.timeout.after_interrupted": 60, "runner.timeout.process_alive": 60, "runner.timeout.process_died": 10, "simpletests.status.failure_fields": ["status", "stdout", "stderr"], "simpletests.status.skip_location": "all", "simpletests.status.skip_regex": "^SKIP$", "simpletests.status.warn_location": "all", "simpletests.status.warn_regex": "^WARN$", "spawner.podman.bin": "/usr/bin/podman", "spawner.podman.image": "fedora:33", "sysinfo.collect.commands_timeout": -1, "sysinfo.collect.enabled": true, "sysinfo.collect.installed_packages": false, "sysinfo.collect.locale": "C", "sysinfo.collect.optimize": false, "sysinfo.collect.per_test": false, "sysinfo.collect.profiler": false, "sysinfo.collectibles.commands": "/etc/avocado/sysinfo/commands", "sysinfo.collectibles.fail_commands": "/root/rpmbuild/QA/BUILD/avocado-91.0/avocado/etc/avocado/sysinfo/fail_commands", "sysinfo.collectibles.fail_files": "/root/rpmbuild/QA/BUILD/avocado-91.0/avocado/etc/avocado/sysinfo/fail_files", "sysinfo.collectibles.files": "/etc/avocado/sysinfo/files", "sysinfo.collectibles.profilers": "/etc/avocado/sysinfo/profilers", "task.timeout.running": null, "vt.common.arch": null, "vt.common.data_dir": "", "vt.common.machine_type": "i440fx", "vt.common.mem": null, "vt.common.netdst": "virbr0", "vt.common.nettype": "bridge", "vt.common.tmp_dir": "/tmp", "vt.common.type_specific_only": false, "vt.config": null, "vt.debug.no_cleanup": false, "vt.extra_params": [], "vt.filter.default_filters": null, "vt.guest_os": "JeOS.27", "vt.libvirt.connect_uri": null, "vt.list_archs": false, "vt.list_guests": false, "vt.no_filter": "", "vt.only_filter": "", "vt.qemu.accel": "kvm", "vt.qemu.defconfig": "yes", "vt.qemu.disk_bus": "virtio_blk", "vt.qemu.image_type": "qcow2", "vt.qemu.malloc_perturb": "yes", "vt.qemu.monitor": "", "vt.qemu.nic_model": "virtio_net", "vt.qemu.qemu_bin": null, "vt.qemu.qemu_dst_bin": "/usr/bin/qemu-kvm", "vt.qemu.sandbox": "on", "vt.qemu.smp": "2", "vt.qemu.vhost": "off", "vt.save_config": null, "vt.setup.backup_image_before_test": true, "vt.setup.keep_guest_running": false, "vt.setup.restore_image_after_test": true, "vt.short_names_when_config": false, "vt.type": "qemu", "yaml_to_mux.files": [], "yaml_to_mux.filter_only": [], "yaml_to_mux.filter_out": [], "yaml_to_mux.inject": [], "yaml_to_mux.parameter_paths": ["/run/*"], "run.test_runner": "nrunner", "run.ignore_missing_references": true} tags=json:{} variant=json:{"paths": ["/run/*"], "variant_id": null, "variant": [["/", []]]} -s /mnt/local/results/job-2021-10-15T15.56-78605ba/.status_server.sock
  36132 pts/4    Z+     0:00  |                       \_ [avocado-runner-] <defunct>
  • Trying to replicate your build environment. I remember you did post somewhere your "recipe" (LXC container, exact commands), but I couldn't find it easily. If you could please post it here again, I'll give that a try.

I can't recall either. If you simply manage to install an LXC container like

$ lxc-create -t download -n c1 -- --dist fedora --release 33 --arch amd64 \
>     --keyserver hkp://keys.openpgp.org:80 \

and provide networking, it should be reproducible with just cloning the official avocado repo from within and running the unit tests. I am not sure if this is helpful but if you cannot reproduce it that way or it proves too hard let me know.

There is also a third option which is why this issue is stated as a feature request - try to provide more information about what happened to the processes. For instance, instead of taking the steps you would usually take in order to figure this out in a reproduced environment, you could mention some of these steps here and I could try them out, then we could think about which ones of them could be integrated for better debugging verbosity of such situations.

pevogam commented 2 years ago

Retried this on master avocado with a really minimalistic container today (as few dependencies as possible) and I still get the error. The minimal setup steps are:

git clone https://github.com/avocado-framework/avocado.git
cd avocado/
dnf install pip
pip install -e .
export PATH=/root/rpmbuild/QA/BUILDROOT/python-avocado-93.0-2.fc33.x86_64/usr/bin:/root/rpmbuild/QA/BUILDROOT/python-avocado-93.0-2.fc33.x86_64/usr/libexec/avocado:/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
export PYTHONPATH=/root/rpmbuild/QA/BUILDROOT/python-avocado-93.0-2.fc33.x86_64/usr/lib/python3.9/site-packages:.
export LANG=en_US.UTF-8
export AVOCADO_CHECK_LEVEL=0
/usr/bin/python3 selftests/check.py --skip static-checks --disable-plugin-checks robot

I will check for more details on the setup of the LXC container but when setting up bare bone LXC the way we do (with our own automation layer on top) one has to manually take care of the container networking. This is in contrast to LXD.