Restarting the appliance leaves old podman tmpfs directories causing failures on next boot

agrare commented 1 month ago

After restarting the appliance subsequent podman runs fail with the following error: Error: current system boot ID differs from cached boot ID; an unhandled reboot has occurred. Please delete directories \"/tmp/storage-run-988/containers\" and \"/tmp/storage-run-988/libpod/tmp\" and re-run Podman

The issue being podman expects its temp directories to be on a tmpfs filesystem and thus cleared on boot, however the ManageIQ appliance has an XFS formatted logical volume for /tmp so it is persisted across boots.

Fryguy commented 1 month ago

Is there a reason we don't use tmpfs? Maybe @bdunne or @jrafanie know.

agrare commented 1 month ago

Yeah that was a surprise to me as well, according to this podman expects its temp dirs to be on tmpfs https://github.com/containers/podman/discussions/23193#discussioncomment-9958326

jrafanie commented 1 month ago

I assume it was never changed. For debugging, some stuff would be retained in tmp so it might be helpful to not lose it on reboot.

Fryguy commented 1 month ago

Interesting - I can't think of what we would put diagnostically in /tmp, except maybe ansible runner things, and those are cleaned up anyway.

jrafanie commented 1 month ago

Interesting - I can't think of what we would put diagnostically in /tmp, except maybe ansible runner things, and those are cleaned up anyway.

Right, I think that's the question. If it uses the system tmp by default, things will use it. Ansible was indeed one of the things I was thinking of but even automate. I'm not sure if anything on the system uses it as most should use journal. Does kickstart use it for storing the kickstart or whatever it's applying? I vaguely recall that might be the home directory. I'm not sure if that changed after we stopped running as root. cloud-init? I'm not sure. It should be using journal but not sure if any temporary stuff useful for diagnostics were kept there.

I'm not saying we can't make it tmpfs, only that this was probably the reason we wanted it persisted.

agrare commented 1 month ago

It appears that a LV backed /tmp was added here: https://github.com/ManageIQ/manageiq-appliance-build/pull/51

I'm not sure why STIG recommends a persistent /tmp but if that is no longer a requirement then it looks like we'd be safe to delete it.

Honestly a 1GB partition backed /tmp isn't guaranteed to be "more space" than a RAM backed tmpfs anyway with the amount of memory we recommend so I don't think this is a "usage" argument. Also nothing (well designed) should ever depend on /tmp persisting across reboots and I highly doubt ansible would depend on that but we can always check it.

Rebooting in the middle of an ansible-runner execution should be clear and clean on boot as the execution is certainly cut short.

jrafanie commented 1 month ago

I'm not sure why STIG recommends a persistent /tmp but if that is no longer a requirement then it looks like we'd be safe to delete it.

It looks like STIG wanted separate filesystems and not specifically persistent.

agrare commented 1 month ago

It looks like STIG wanted separate filesystems and not specifically persistent.

:+1: now that I 100% agree with but we would have already had that (I hope)

jrafanie commented 1 month ago

Also nothing (well designed) should ever depend on /tmp persisting across reboots and I highly doubt ansible would depend on that but we can always check it.

Rebooting in the middle of an ansible-runner execution should be clear and clean on boot as the execution is certainly cut short.

Right. I don't recall anything depending on persistent data in /tmp, only that the diagnostics were there for review if there was a problem. Again, it's been so long and I have a strong feeling the only thing we'd ever lose is diagnostics. I think we've moved a lot of the diagnostics to actual logging through journal so this is less of an issue than even at the time of this PR.

jrafanie commented 1 month ago

It looks like STIG wanted separate filesystems and not specifically persistent.

👍 now that I 100% agree with but we would have already had that (I hope)

Before that PR, logs, tmp, home, etc. were all on the / volume. He created separate filesystems for each. I think we're agreeing but want to make it clear it was really about splitting up the different types of data on different volumes with different filesystems and less about the implementation of the volume.

agrare commented 1 month ago

Before that PR, logs, tmp, home, etc. were all on the / volume.

Oh interesting, thanks yeah I didn't realize /tmp was just on /...

Fryguy commented 1 month ago

https://www.stigviewer.com/stig/red_hat_enterprise_linux_8/2021-12-03/finding/V-230295

Fryguy commented 1 month ago

Also nothing (well designed) should ever depend on /tmp persisting across reboots and I highly doubt ansible would depend on that but we can always check it.

We temporarily checkout git repos to /tmp and then ansible-runner in it, so that's not ansible doing it. For diagnostics, you can set an ENV var to tell our Ansible::Runner code to not clean up after itself, so you can see what's in there, but even in that case I'd want tmpfs to go away and cleanup on reboot.

Fryguy commented 1 month ago

I'm 100% for /tmp being a separate volume that is tmpfs - and based on the discussion here I'm pretty sure that satisfies STIG as well.

Fryguy commented 1 month ago

Strangely, I can't find anything STIG related in our code base - I was pretty sure we had something in appliance console, but maybe that was CloudForms specific? I also don't see anything STIG related in the Appliance Hardening Guide.

That being said, I also recall that we tried to default as much as we could from the STIG and SCAP rules so that we could minimize changes that the customer needed to do themselves.

agrare commented 1 month ago

Interestingly there is a podman-clean-transient.service oneshot service but that doesn't cleanup anything in /tmp assuming that they are automatically cleared.

[root@manageiq ~]# runuser --login manageiq --command 'podman system prune --external'
Error: current system boot ID differs from cached boot ID; an unhandled reboot has occurred. Please delete directories "/tmp/storage-run-986/containers" and "/tmp/storage-run-986/libpod/tmp" and re-run Podman

ManageIQ / manageiq-providers-workflows

Restarting the appliance leaves old podman tmpfs directories causing failures on next boot #116