controlpath file is not removing if reboot is happen in middle of the test

smruti77 commented 3 years ago

we already added in the test to exit the session after completing the test. if test is running fine we are not getting ERROR: Failed connecting issue. but sometime in between test any interruption happen or in middle of the test any reboot is happening than the controlpath file is not removing ,we are again facing ERROR: Failed connecting issue.

beraldoleal commented 3 years ago

Can you please provide more information ? What controlpath are you talking about? Can you describe a "how to reproduce" here?

smruti77 commented 3 years ago

i am talking about ssh controlpath what is present in this directory, /root/.ssh ex, avocado-master-root@121.1.1.88:22

abdhaleegit commented 3 years ago

Hey @beraldoleal this is a similar one as discussed https://github.com/avocado-framework/avocado/issues/3810#issuecomment-645370367

We have a CI automation framework where in the same tests were run on multiple different kexec kernels and things are fine till here as each test is calling host.remote_session.quit() to cleanup master control path job and files ( a good path test)

At times we have seen a kernel crash happens when testing were running due to real linux issues and kernel panics and system boots and the test which was running did not call host.remote_session.quit() to clean up master control files so next time the test again try to use the old control path file and fails to connect to peer

let me elaborate here with actual recreation steps

I have my lpar booted to base kernel 4.18.0-147.el8.ppc64le and there was no master control started

# ls .ssh/
id_rsa  id_rsa.pub  known_hosts
#  ssh -o 'StrictHostKeyChecking=no' -o 'UpdateHostKeys=no' -o 'ControlPath=~/.ssh/avocado-master-%r@%h:%p' -l root -p 22 -O check 121.1.1.88
Control socket connect(/root/.ssh/avocado-master-root@121.1.1.88:22): No such file or directory

Now start the ssh connection and create a control path file

 # ssh -o 'ControlMaster=yes' -o 'ControlPersist=yes' -o 'StrictHostKeyChecking=no' -o 'UpdateHostKeys=no' -o 'ControlPath=~/.ssh/avocado-master-%r@%h:%p' -l root -p 22 121.1.1.88
root@121.1.1.88's password: 
Activate the web console with: systemctl enable --now cockpit.socket

This system is not registered to Red Hat Insights. See https://cloud.redhat.com/
To register this system, run: insights-client --register

Last login: Mon Nov  9 05:28:23 2020 from 121.1.1.77
 # exit
logout
Shared connection to 121.1.1.88 closed.
# ls .ssh/
avocado-master-root@121.1.1.88:22  id_rsa  id_rsa.pub  known_hosts
#  ssh -o 'StrictHostKeyChecking=no' -o 'UpdateHostKeys=no' -o 'ControlPath=~/.ssh/avocado-master-%r@%h:%p' -l root -p 22 -O check 121.1.1.88
Master running (pid=12651)

So the same master control path file can be use for ssh connections with out login

but when I do kexec boot of a different kernel, with out stoppiing master controlpath job (trying to mimic a kernel crash here)

# kexec -l linux/vmlinux --initrd=linux/initrd --append=rw
Modified cmdline:rw root=/dev/mapper/rhel_ltcfleet2--lp20-root 
#  kexec -e

login to new kernel and check for ssh control path connection

Last login: Mon Nov  9 05:35:31 2020 from 9.85.146.82
 # uname -r
5.10.0-rc2-autotest-00140-g521b619acdc8
# ls .ssh/
avocado-master-root@121.1.1.88:22  id_rsa  id_rsa.pub  known_hosts

^^ you can see the master control path file from previous kernel still exists (it did not got deleted)

#  ssh -o 'StrictHostKeyChecking=no' -o 'UpdateHostKeys=no' -o 'ControlPath=~/.ssh/avocado-master-%r@%h:%p' -l root -p 22 -O check 121.1.1.88
Control socket connect(/root/.ssh/avocado-master-root@121.1.1.88:22): Connection refused

and ssh check failed and due to this all our test are failing.

So the problem is when there is unclean test end due to system is booted to new kernel or crash.. than the controlpath files are not cleanedup resulting in ssh connection failures

beraldoleal commented 3 years ago

Hi @abdhaleegit, thanks for the detailed report.

What is your suggestion?

Because the way I see this there is no much an "internal running software" could do here. All systems are exposed to some kind of kernel/system crash, specially on a "system stress test". Yes, we could add a try.. except, checking for this control path there, and if exists, we could remove it. But IMO this is not the proper way to solve the problem, because like I said before, the control path is there to ensure a new quick connection.

So, to be able to determine if the failure is because the machine is down (or any firewall is blocking) or it is a completely new machine, it is not a trivial task.

Another way would be to disable the control path for your case, but this could make the test really slow, because of the new connections.

IMO, the environment should be always clean/fresh after boot, before running tests, to avoid this kind of problem. I guess this would make your tests even more consistent. For instance, one of the libvirt project requirements for running the tests is that the system is completely fresh/new for each boot.

Isn't there any way to ensure that the system is "clean/fresh new" before tests (after boot)? Is this possible?

willianrampazzo commented 3 years ago

From @abdhaleegit comment:

(trying to mimic a kernel crash here)

I don't see how Avocado would keep the status before and after a possible crash. This is really a test specific issue and I also agree it should be handled on the test side.

avocado-framework / avocado

controlpath file is not removing if reboot is happen in middle of the test #4290