cal-itp / benefits

Transit benefits enrollment, minus the paperwork.
https://docs.calitp.org/benefits
GNU Affero General Public License v3.0
27 stars 9 forks source link

Refactor how data is persisted in Azure #2153

Closed thekaveman closed 3 months ago

thekaveman commented 3 months ago

Recently, we have been seeing errors in dev related to attempting sign in to the Admin and the database being in a locked state:

OperationalError: database is locked

While we have only seen this in dev so far, it seems inevitable to come up in test and/or prod, where it will really be a problem if we can't access the Admin or if the app is locked for user requests.

The root of the issue is due to the changes from #1773 to persist the (Sqlite) database file in a mounted Azure file share. This mostly works for reads; but now that we have the potential for more writes via the Admin, it is starting to break down (as expected, see e.g. this SO post).

The idea is to reuse as much of the existing infrastructure and process as possible, with two changes:

This should solve the Sqlite-over-a-network issue, while allowing us to maintain data configured via the Admin across restarts/deployments.

Acceptance Criteria

Plan

Additional context

This FAQ article about Azure Web App Containers has been very informative: https://techcommunity.microsoft.com/t5/apps-on-azure-blog/things-you-should-know-web-apps-and-linux/ba-p/392472

We can also think about dumping fixture data on some frequency into the network file share for a backup, following the commands from the notes in https://github.com/cal-itp/benefits/issues/1856#issue-2086716269

One idea for implementing this is by using Azure WebJobs, which is now in Preview for Linux Containers and can run simple bash .sh scripts.

angela-tran commented 3 months ago

One idea for implementing the frequent fixture dumping is using Azure WebJobs, which is now in Preview for Linux Containers and can run simple bash .sh scripts.

A small wrinkle is that this feature requires the setting WEBSITES_ENABLE_APP_SERVICE_STORAGE=true

Another idea could be to install cron in the Docker container and set up a cron job to run whatever script / commands we need to. Maybe this would get around needing to move all our files.

thekaveman commented 3 months ago

Another idea could be to install cron in the Docker container and set up a cron job to run whatever script / commands we need to. Maybe this would get around needing to move all our files.

This is definitely another option. There are a couple issues to work through, most notably cron wants to run as root but we use a non-root user inside the container. We're also already running 2 processes inside the container (nginx and gunicorn) which is against how Docker is "supposed" to be used (1 process per container). cron is finicky to get output from and will often silently fail, we'd need to redirect the output to the Docker logs to be able to see anything (not impossible, but tricky I think).

With a WebJob, we would get some type of logging in the Azure UI and we don't have to worry about the underlying runtime. I'm open to other ideas though if it seems like too much of a change to uproot our source tree.

angela-tran commented 3 months ago

This is definitely another option. There are a couple issues to work through, most notably cron wants to run as root but we use a non-root user inside the container. We're also already running 2 processes inside the container (nginx and gunicorn) which is against how Docker is "supposed" to be used (1 process per container). cron is finicky to get output from and will often silently fail, we'd need to redirect the output to the Docker logs to be able to see anything (not impossible, but tricky I think).

With a WebJob, we would get some type of logging in the Azure UI and we don't have to worry about the underlying runtime. I'm open to other ideas though if it seems like too much of a change to uproot our source tree.

I agree with the lack of visibility into failures if we just run cron in the container, and I hadn't thought about the user and running processes. Azure WebJobs seems like a good direction to continue with.

thekaveman commented 3 months ago

Perhaps another option (this seems difficult to get working correctly as well): avoid cron and WebJobs all together, use GitHub Actions to initiate the dumpdata command via a Remote SSH Connection to the running container.

We would need to modify the container to enable SSH.

This also seems hard to get the SSH server running in the container as a non-root user: https://www.golinuxcloud.com/run-sshd-as-non-root-user-without-sudo/

thekaveman commented 3 months ago

I rewrote this ticket to focus on the just the updates for setting WEBSITES_ENABLE_APP_SERVICE_STORAGE=true and seeing how far that gets us.

The automated fixture dump/restore could be a follow-up if these changes end up working.