multivm-stress:Update script to test all edgecases

misanjumn commented 2 months ago

multivm-stress:Update script to test all edgecases

This patch captures multiple edge cases to test multivm scenarios. The following updates are added:

add stress_time parameter to run stress test for n seconds before starting stress_events
add debug_dir parameter to save the the debug files
add dump_options parameter to specify virsh dump type
update guest on_crash value to preserve in case of crash
add function check_call_traces to check for any call trace in dmesg
during stress, check for guest state and call traces every ten minutes
if any crashed vms, dump the vm to the debug_dir for further analysis
run stress_events in the remaining stable vms if present, else skip
check for error messages and fail the test if found

Signed-off-by: Misbah Anjum N misanjum@linux.vnet.ibm.com

misanjumn commented 2 months ago

This patch is dependent on the PR: https://github.com/avocado-framework/avocado-vt/pull/3972

misanjumn commented 2 months ago

I have verified all scenarios with respect to the patch. Few examples are captured below:

Scenario 1:

run stress in 4 guests no stress_events status: 3 guest in crashed state, 1 guest running. dump of 3 guests taken

 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events:  FAIL: Failure in vm3, vm2, vm1 while running stress.  (87736.96 s)
RESULTS    : PASS 0 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

Scenario 2:

run stress in 4 guests run reboot stress_events status: all 4 guests passed

 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events:  PASS
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

Scenario 3:

run stress in 4 guests run reboot stress_events status: 2 guests crashed, 2 guests login timeout issue. dump taken of 2 guests. skip stress_events since all guests in unstable state

 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events:  FAIL: Failure in vm4, vm2 while running stress.  Login error in vm1, vm3 while running stress. All vms in unstable state while running stress. Couldn't run STRESS EVENTS (78735.52 s)
RESULTS    : PASS 0 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

Scenario 4:

skip running stress run reboot stress_events status: all 4 guests passed

 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events:  PASS
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

Scenario 5:

run stress in 4 guests run reboot stress_events status: 1 guest crashed. dump taken of 1 guest. ran stress_events in remaining guests

 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.multivm_cpustress.custom_host_events.custom_vm_events:  FAIL: Failure in vm4 while running stress. (157742.15 s)
RESULTS    : PASS 0 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

misanjumn commented 2 months ago

Explanation of check_call_traces function

Handling login timeout issue The function first tries to log into the guest. If unable to log into the guest, it retries logging into it for 3 times. If this doesn't work, then append the guest into loggin_error_vms list. If in the next cycle, login is successful, remove the guest from the loggin_error_vms list and continue If in the next cycle, login is still unsuccessful, then do not retry more than once and continue with the next step
Check dmesg If logged into guest successfully, move ahead to send dmesg command and check the presence of call traces. If call traces found, append the guest to failed_vms list in order to take it's dump in the next stage

        def check_call_traces(vm):
            nonlocal stress_timer
            found_trace = False
            try:
                retry_login = True
                retry_times = 0
                while retry_login:
                    try:
                        retry_login = False
                        session = vm.wait_for_login(timeout=100)
                        if vm in login_error_vms:
                            login_error_vms.remove(vm)

                    except Exception:
                        stress_timer -= 150
                        if vm in login_error_vms:
                            return False

                        retry_login = True
                        retry_times += 1
                        if retry_times == 3:
                            logging.debug("Error in logging into %s" % vm.name)
                            if vm not in login_error_vms:
                                login_error_vms.append(vm)
                            return False

                        time.sleep(30)
                        stress_timer -= 30

                dmesg = session.cmd("dmesg")
                dmesg_level = session.cmd("dmesg -l emerg,alert,crit")
                if "Call Trace" in dmesg or len(dmesg_level) >= 1:
                    logging.debug("Call trace found in %s" % vm.name)
                    if vm not in failed_vms:
                        failed_vms.append(vm)
                    found_trace = True
                session.close()

            except Exception as err:
                test.error("Error getting dmesg of %s due to %s" % (vm.name, str(err)))
            return found_trace

autotest / tp-libvirt