PaloAltoNetworks / pan-os-ansible

Ansible collection for easy automation of Palo Alto Networks next generation firewalls and Panorama, in both physical and virtual form factors.
https://pan.dev/ansible/docs/panos
Apache License 2.0
209 stars 97 forks source link

SSL session breaks on FW reboot #285

Open Leanessc1 opened 2 years ago

Leanessc1 commented 2 years ago

Is your feature request related to a problem?

Any module that restarts the FW fails if the management interface access traverses the firewall dataplane that is being restarted. This causes jobs to fail with SSL connection timeout.

Describe the solution you'd like

Add options to ignore the post checks after the restart command has been sent to the FW. Allow users to complete their own post reboot checks rather then rely on built in checks.

Describe alternatives you've considered

Currently we use the ignore_errors directive for any restart.

Additional context

The current restart jobs leave an open session between Ansible and the FW mgmt interface being restarted. If access to the mgmt interface traverses the data plane of the FW being restarted the mgmt interface will never be able to reset the SSH session. The data plane drops traffic from the SSH session as there is no three way handshake or open session after the restart has occurred.

In the case of the MGMT interface being directly exposed the restart works as intended and when the firewall has come back online the post checks successfully occur.

welcome-to-palo-alto-networks[bot] commented 2 years ago

:tada: Thanks for opening your first issue here! Welcome to the community!

shinmog commented 2 years ago

I'm a little confused because you're talking about SSH commands, and I think only one module (or two..?) does SSH, everything else is using the API. Additionally, if you're restarting PAN-OS, that's what panos_check is for; waiting for PAN-OS to come back online and ensuring it's ready to accept commands again.

jamesholland-uk commented 1 year ago

I think we can replace all the references to SSH with TLS instead, it's just generic talk about connectivity from where Ansible is executing to the target firewall :-)

Based on this Issue and Issue #263, it seems there are scenarios where a task using the panos_software module, with param restart: true, does not complete cleanly. The playbook never gets to use a panos_check task or any other subsequent tasks, because the panos_software task times out and halts the playbook execution (per the logs in #263); hence the need to use ignore_errors for the rest of the playbook to execute.

The failure condition seems anecdotally to be: when the Ansible-to-target-firewall XML API communication goes through the firewall being upgraded. The restart command is issued via XML API, the dataplane interfaces drop immediately, and they drop so quickly that the restart command is not acknowledged; the response to the XML API restart command is never received by the host executing Ansible. I think.

The above is pieced together from the evidence described in the two Issues raised. I have replicated the scenario (executing an Ansible playbook against a firewall, where the comms from Ansible to the firewall mgmt interface go through the firewall itself) but I could not replicate the failure condition. I think the "stale session" scenario (attempted re-use of a session without starting with a 3-way TCP handshake) is not involved, as Ansible (and the XML API under the hood) do not use long-lived sessions, the modules use individual XML API calls which would be separate TCP sessions.

@Leanessc1 @bizmark07 Are you able to comment any further, or provide any more guidance on how to replicate this so we can investigate further?

thomaschristory commented 1 year ago

Hello,

I am running in the exact same issue. We are connecting to the firewalls via the panorama here. The firewalls management data is going through the dataplane. It would be nice to be able to have the connection check be off in case of restart, because as of now, the playbook fails even though the upgrade is successful.

jamesholland-uk commented 1 year ago

Hi @thomaschristory (and also @Leanessc1, @bizmark07), are you able to share any verbose level logs for your scenario, or more detail about the topology? We've never been able to replicate the failure scenario so it is hard to troubleshoot and provide a fix. Thank you