NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I want a health check implemented on the HAProxy Podman container that restarts it if HAProxy becomes unresponsive or goes down. #172

Open epag opened 1 month ago

epag commented 1 month ago

Author Name: Hank (Hank) Original Redmine Issue: 130882, https://vlab.noaa.gov/redmine/issues/130882 Original Date: 2024-05-31


This is primarily for ITSG, so I posted ITSG-2601. Braden had offered this in ITSG-2322:

It looks like this has been up and running for a few days without issue. I'm going to close this ticket but I want to offer a follow up. I've seen HA Proxy on Docker have issues with crashes and causing this same issue. In a previous workplace, we added health checks to reload the container when it crashed. If you'd like to discuss that option please reach out to me and we'll work that in a new ticket.

If there is any work for me/us to do, I'll report on it in this ticket, but I think all of the work needs to be done by ITSG. I'll provide updates here as I receive them.

Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-11T16:30:45Z


Braden is setting up an "autoheal" container, as described in this document:

https://docs.google.com/document/d/1PimyTpykkOvRXbtz74vaTKhE7tocKVDbge2FyMZ4ItY/edit#heading=h.bn6obndze151

I have a meeting scheduled for Thursday afternoon with Braden. I think he intends to set it up in staging at that time. Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-14T16:54:38Z


Summary of conversation with Braden...

He's going to setup a Podman container to run @docker-autoheal@, which also works with Podman:

https://github.com/willfarrell/docker-autoheal

He said that the image will be put in the registry to ensure it gets scanned, and then we'll pull the image down from there. He's going to prepare the Podman configuration, as well as make any changes needed to the HAProxy configuration, and let me know where those files are on the staging machine. When the proxies were set up before, I was given access to the HAProxy configuration, but the Podman container configuration file was not opened up to me (I don't even know where it is). That was because ITSG intended to maintain it. Now, it may end up being us who have to maintain it. I'm okay with that, even if its not my preference; I just wanted to note the change.

Reproducing the issue that causes the HAProxy issues before may be difficult to reproduce. I told him that, so long as we can do some level of testing, exact reproduction may not be necessary. We can just deploy to production and then check every now and then to see if a container was restarted.

The idea is to get this setup today (he hopes), and then check in on it late next week. If its working well, we can deploy to production. I'll talk to him at that time about getting the configurations into a repository somewhere, perhaps private GitHub.

Any questions or concerns, let me know,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-18T15:02:17Z


Braden needs to migrate the proxy servers to RHEL 8 in order to establish the health check, because it requires podman-compose, and that is not available for RHEL 7. He is going to test the migration in staging soon, perhaps today.

As for production, I proposed that he establish a new VM on the side and then we switch it over. This is to minimize downtime, which Braden said would be a, "couple hours". What he described, however, includes many more steps than just migrating in place:

we can do a side by side upgrade, migrate you to a new VM rather than in place. You'd get a new hostname on the new naming standard, have to make a few network changes, new SSL certs, a few other things on the back end.

My gut tells me that a side migration is still appropriate, particularly if something goes awry. What do you all think?

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-18T15:05:31Z


Yeah, I would do something on the side and then cut across. Good to know that podman-compose requires rhel8, so that is the order for cowres too.

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-18T16:47:35Z


Thanks. I've asked ITSG to do the work in production on a separate machine.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-21T14:06:59Z


Braden started working on the staging proxy migration to RHEL 8. The staging proxy, at the moment, is down, so any WRES developer needing to use it should use the internal URL, nwcal-wres-ti.[domain]. As of the moment, the proxy does appear to be available if you add the port number to the URL, but that's not a long term solution (we shouldn't need the port number) and Braden reported that he's seeing other issues, as well. I directed him to the correct location of the HAProxy configuration, to the best of my knowledge (according to a wiki, @/opt/haproxy/cfg/haproxy.cfg@), but told him that I can't login at the moment to confirm that location.

More later,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-07-02T13:36:07Z


I don't think this is done yet, moving

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-07-02T13:42:16Z


Staging is done and looks good, except that I can't run the podman restart commands that I used to run. However, this ticket cannot be closed until production is changed. So, moving to 6.25 is correct.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-07-03T17:50:37Z


Braden was unable to add the health check at this time. Per Braden, the health check software required a UBI 9 image. However, the UBI 9 image is not FIPS enabled. FIPS apparently is required for the the DMZ:

Confirmed with Todd, we have to use FIPS container images in the DMZ and there is no approved FIPS container on UBI 9 so that's going to be a blocker for this effort. I'm going to revert the running HA Proxy on the TI instance to the UBI 8 FIPS container. I'll be backlogging this ticket until we have a UBI 9 FIPS image to work with. Sorry we can't get these health checks implemented in the DMZ

As stated above, once a FIPS-enabled version is available, he plans to add the health check.

This ticket is essentially on-hold until the needed image is available, so I'll put it on-hold. But, again, the legwork for this ticket is to be done by ITSG, not us; I'm only using this ticket to take notes for our own edification. Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-07-30T19:36:16Z


Part of the health check implementation is a move to RHEL 8 for the proxy. That was done for staging, but not yet production. For production, Braden is tracking his work to establish a proxy VM next to the default one in "ITSG-2853":https://jira.nws.noaa.gov/servicedesk/customer/portal/1/ITSG-2853. He stood up @https://owpal-dmz-wresp01.[domain]@, Ron opened up access, and my tests of access to the production services via that URL pass. They are now waiting for me to give a thumbs up to cutover to the new machines. That requires essentially mapping the @wres.[domain]@ URL to point to the new proxy machine.

Evan, Arvin: I ask that each of you do at least some testing of the new proxy machine, spending hopefully just a couple or few minutes to see if you can interact with the WRES GUI and COWRES through it as usual. Specifically, using either @wres.[domain]@ or @owpal-dmz-wresp01.[domain]@ should give you the same results when interacting with the COWRES or WRES GUI.

If you spot any issues, please let me know. If I don't hear anything by tomorrow morning, I'll post a comment on the ITSG ticket giving them a thumbs up.

Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-07-31T15:42:42Z


No one reported issues, so I gave the thumbs up for ITSG to cutover to the RHEL 8 proxy in production. Braden is hoping to cutover tomorrow. Ron estimates an outage via proxy URL that lasts about 5 minutes. If its much more than that, I'll report the outage via News item. Otherwise, I'm guessing our users aren't likely to be impacted. Still, I'll keep my eyes on the service to see who is using it when the cutover happens. I'll also do some immediate testing once its done.

Again, that is just for the RHEL 8 component of this ticket. The health check component will still need to wait for a UBI 9 image that is FIPS enabled.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-08-01T19:40:49Z


The proxy is currently down. I hope that that means the cutover is in progress.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-08-01T19:58:15Z


If the proxy URL does not come back up before my day is done, I think I need to post a News item about it letting folks know the WRES will be down. How about this?:

The WRES web services are unavailable through the URL, wres.[host], while ITSG upgrades a machine supporting that URL. It should become available later this evening. Thanks for your patience,

The WRES team

I decided against encouraging them to use the internal NWC URLs, since the WRES GUI would still fail (service assignments point to the proxy URL) and, in general, I don't want to point users to those URLs.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-08-01T20:07:02Z


Scratch that. The outage only impacts folks on the NWC VPN. RFCs and me when I'm connected to the Silver Spring, MD, VPN, can access the COWRES as usual. The WRES GUI can also access the COWRES.

Evan, Arvin: Please be aware that you won't be able to access the COWRES or WRES GUI in production when logged into the NWC VPN until they fix the issue, which hopefully won't be too long. Alternatively, you can use the internal URLs, nwcal-wres-prod and nwcal-wresgui-prod01.

My work day is done. Have a great evening!

Hank