Open Leanessc1 opened 2 years ago
:tada: Thanks for opening your first issue here! Welcome to the community!
I'm a little confused because you're talking about SSH commands, and I think only one module (or two..?) does SSH, everything else is using the API. Additionally, if you're restarting PAN-OS, that's what panos_check
is for; waiting for PAN-OS to come back online and ensuring it's ready to accept commands again.
I think we can replace all the references to SSH
with TLS
instead, it's just generic talk about connectivity from where Ansible is executing to the target firewall :-)
Based on this Issue and Issue #263, it seems there are scenarios where a task using the panos_software
module, with param restart: true
, does not complete cleanly. The playbook never gets to use a panos_check
task or any other subsequent tasks, because the panos_software
task times out and halts the playbook execution (per the logs in #263); hence the need to use ignore_errors
for the rest of the playbook to execute.
The failure condition seems anecdotally to be: when the Ansible-to-target-firewall XML API communication goes through the firewall being upgraded. The restart command is issued via XML API, the dataplane interfaces drop immediately, and they drop so quickly that the restart command is not acknowledged; the response to the XML API restart command is never received by the host executing Ansible. I think.
The above is pieced together from the evidence described in the two Issues raised. I have replicated the scenario (executing an Ansible playbook against a firewall, where the comms from Ansible to the firewall mgmt interface go through the firewall itself) but I could not replicate the failure condition. I think the "stale session" scenario (attempted re-use of a session without starting with a 3-way TCP handshake) is not involved, as Ansible (and the XML API under the hood) do not use long-lived sessions, the modules use individual XML API calls which would be separate TCP sessions.
@Leanessc1 @bizmark07 Are you able to comment any further, or provide any more guidance on how to replicate this so we can investigate further?
Hello,
I am running in the exact same issue. We are connecting to the firewalls via the panorama here. The firewalls management data is going through the dataplane. It would be nice to be able to have the connection check be off in case of restart, because as of now, the playbook fails even though the upgrade is successful.
Hi @thomaschristory (and also @Leanessc1, @bizmark07), are you able to share any verbose level logs for your scenario, or more detail about the topology? We've never been able to replicate the failure scenario so it is hard to troubleshoot and provide a fix. Thank you
Is your feature request related to a problem?
Any module that restarts the FW fails if the management interface access traverses the firewall dataplane that is being restarted. This causes jobs to fail with SSL connection timeout.
Describe the solution you'd like
Add options to ignore the post checks after the restart command has been sent to the FW. Allow users to complete their own post reboot checks rather then rely on built in checks.
Describe alternatives you've considered
Currently we use the ignore_errors directive for any restart.
Additional context
The current restart jobs leave an open session between Ansible and the FW mgmt interface being restarted. If access to the mgmt interface traverses the data plane of the FW being restarted the mgmt interface will never be able to reset the SSH session. The data plane drops traffic from the SSH session as there is no three way handshake or open session after the restart has occurred.
In the case of the MGMT interface being directly exposed the restart works as intended and when the firewall has come back online the post checks successfully occur.