Containerizing Torque - Githubissues

Overview

I was thinking about this over the weekend and wanted to write down my thoughts on it. I couldn't find any existing issues on containerization although I know there was a discussion on it so @frankduncan feel free to direct me there as I know this issue has already been discussed on some level :)

The Problem

Developing torque either requires a VM, server, or machine running debian. Currently, I'd say OTS's development environment is running a server and either

Making changes on the server via SSH, restarting any server bits, and syncing them down locally where you can commit using a secure ssh key or
Making changes locally, committing, pushing, pulling, restarting any server bits that need to be restarted and repeat.

This requires developers to have a fair amount of knowledge to run a server, push and pull files between them fairly easily, and running through our ansible setup defined in a separate repository (which admittedly is fairly straightforward so long as you know which ansible scripts to run.)

My Proposal

We containerize both mediawiki with torquedataconnect and torque.

Containerizing medaiwiki w/ TDC:

Mediawiki already has a maintained container on dockerhub. We can easily extend that container by listing it as a dependency as in a stubbed out dockerfile below:

FROM mediawiki:latest
# Install any addons torque may need here

We would have a separate container for our torque server and perhaps further containers for databases as well which could easily be switched out for managed databases (such as AWS RDS). Running these services in tandem for development can easily be configured using something like docker-compose or if we want to not use docker we can use the kubernetes alternative kompose.

This allows us to have all of the development server hosting & configuration stored in version control, automatically set up in essentially one command for developers (i.e. docker-compose up) and allow developers to easily change and test code without needing to host remotely.

I think containerization could potentially make maintaining our production servers easier as well by simplifying our deployment process to dumping an image into a VM. @kfogel @frankduncan do either of you have thoughts on this?

What would be needed

Obviously containers are new tech and would require some learning to be able to easily use. I think the net benefit would be worth the investment in the long run, especially if we ever want to have volunteer contributors. I am by no means an expert in setting up containerized applications, I'm also not a dev-ops person. However, I have containerized some applications in the past as well as helped configure some small Kubernetes clusters. It'd be great to hear if other OTS team members are interested in utilizing containers as well.

Thanks for the writeup! This is definitely something we should consider. My following thoughts, though negative, aren't intended to come across as a shutdown, but hopefully as a "this is why I didn't, and these are the problems I think we'll have"

My initial thoughts, in general for this project, are against containerization. In general, the problem they solve is rapid deployment and management of software to multiple systems. The cost they incur is a layer of abstraction and another point of failure. We don't really have the problem they solve as we deploy the entire stack to one machine. The layer of abstraction will affect not only the running of the software (where logs are, debugging issues within the container), but also for system upgrades.

Specifically for the LFC project, we do have server configuration in source control. The solution we haven chosen so far is ansible at torque-sites. An example is the somewhat unweildy mediawiki install which would then need to be containerized as well. For LFC, at this point my ignorance on docker fails me because I have no idea the work it would take to turn that ansible file into a containerized, deployable version. The other option is that we have a containerized version for working in development on only torque, but then we need to use the full stack ansible version when actually doing work on a fully built out mediawiki instance. One interesting point is how would containerized mediawiki work with simplesamlphp correctly. I don't know if there's any assumptions that that extension makes that containerization would break.

Moving forward for this project, we want to create a unified mediawiki software stack that uses symlinks of LocalSettings in order to have different competitions configured differently. This allows us to deploy one version of an update to an extension to all competitions simultaneously, and also to get our updates of mediawiki out of ansible and into apt (see issue 56) I feel like that containerized mediawiki instance would be a pain to maintain over a filesystem version based on apt, but I might not be correct on that. We want our sysadmin to be able to update servers as cleanly as possible, hopefully without our involvement, which I don't know we could pull of as mediawiki upgrades if we've fully containerized the mediawiki+extensions+LocalSettings we will deploy for all the competitions.

Similarly, anyone who has a running mediawiki instance, which is normally deployed via system packages, and finds that torque solves the problem facing them will have to migrate their infrastructure to dockerized mediawiki, or have to untangle TDC from our containerized deployment to apply to their infrastructure. Right now the installation instructions are to put the extension in a directory, which is similar to every other mediawiki extension.

I think the greater issue here is that after doing a distro's package management version of "apt install mediawiki", and then following the TDC installation instructions, TDC didn't work. I think that's the real problem that should be fixed, as we would like torque to be a solution for the greater mediawiki community. In reality, instead of docker compose up, a person should just be able to do tar zxf tdc.tar.gz in their extension directory after doing apt install mediawiki and add "LoadExtension('TDC')" to their LocalSettings and go.

All that said, I want to point out that these are all reasons and considerations I had when I thought about containerizing the project, and many of them may be based on simple ignorance. So the conclusion I came to was based on not only the perceived problems moving to docker, as well as the perceived advantages we couldn't really take advantage of, it came at the cost of becoming knowledgeable enough in docker to effectively attack issues, which raised the project time investment way too high.

In general, the problem they solve is rapid deployment and management of software to multiple systems.

While I can't really speak to this problem specifically, I do agree that containers bring additional abstraction. For me the problem I'm most concerned with solving is streamlining the development workflow. That is minimizing the time it can take to get someone running the full torque application as well as minimizing the time it takes for a single change to be made and that be reflected in the application being tested. I think containers really excel at this as the abstraction allows these things to be automated and completed with 0 knowledge from the application developer.

An example is the somewhat unweildy mediawiki install which would then need to be containerized as well.

If I understand correctly, this is already done in the mediawiki image on docker hub. What this means is our container will only be concerned with configuration of a debian machine which already has mediawiki set up. So installing things like simplesaml, the torque extension, etc.

We want our sysadmin to be able to update servers as cleanly as possible, hopefully without our involvement, which I don't know we could pull of as mediawiki upgrades if we've fully containerized the mediawiki+extensions+LocalSettings we will deploy for all the competitions.

I think this upgrade would be as simple as the updating FROM mediawiki:1.0.0 to FROM mediawiki:1.0.1 in out Dockerfile and rebuilding the image. In fact, I think this could make upgrading less of a head ache in some ways.

I think there is something to be said about maybe not wanting to tie our project into requiring docker or an equivalent to be run. However, I think the benefits are pretty strong. I spent around 24 hours last month trying to get torque set up and running, debugging server issues, dealing with AWS instances crashing. Given, I will chalk a lot of that up to my own inexperience, misunderstanding of various parts of torque architecture, and probably just not giving DESIGN.md a deep enough read.

Still having a simple docker-compose up or some sort of equivalent command which can get someone up and running with what would normally require provisioning a VPS, a full ansible install of all of our services, and writing/running scripts to upload test data into torque I think could be a serious win for developer experience.

I think this upgrade would be as simple as the updating FROM mediawiki:1.0.0 to FROM mediawiki:1.0.1 in out Dockerfile and rebuilding the image. In fact, I think this could make upgrading less of a head ache in some ways.

The goal is to apt update/upgrade and do the entire system for all security updates, agnostic of what's on the system. We won't meet that goal fully, but our hope is to get as close as possible. That doesn't necessarily mean that this can't be one of the exceptions, but it won't make upgrading less of a headache than using the distributed mediawiki packages.

Still having a simple docker-compose up or some sort of equivalent command which can get someone up and running with what would normally require provisioning a VPS

I agree that we need this, and we need to streamline development, but I'm not sure docker is the path forward. A one-stop ansible command for a demo project is more in line with how we're deploying competitions right now. The project is probably too small to maintain multiple deployment avenues and keep them all up to date, so unless moving to docker in production is on the docket (ha), this may get subjected to bitrot.

This should probably become a real-time discussion, when we all get a chance. My high-level takeaways so far are:

1) This is something we would do to make development easier. It wouldn't affect our recommended production deployment method. 2) Therefore, could the container essentially be a ready-to-go Debian machine that the developer spins up and then points our usual ansible scripts at? (I'm leaving it deliberately vague whether "ready-to-go" includes having Mediawiki already present or not; I don't think that's the most important question here, though it's one we'd have to decide eventually).

Both Mike and Yaxel have faced difficulty getting Torque set up the first time. If we say that the way to streamline development is to make a one-stop ansible command, that leaves unanswered the question of what destination the developer is supposed to point the ansible command at.

Mike's solution has been to stand up an AWS server, which I think is both a perfectly reasonable solution under the circumstances and simultaneously kind of ridiculous: why in the heck would a developer need to active a remote machine just to work on Torque? A container doesn't vanish when your network goes down; you never experience netlag working with a container.

On the other hand, then Mike could in theory have grabbed any available standard Debian container image and used that (as I describe above) for development, and he chose not to do that. There's probably a reason he made that choice, and this suggests that maybe I don't understand all the constraints here. Hey, at least I'm not reluctant to look ignorant in a GitHub issue ticket -- I don't know how I'd do my job if I had an allergy to that! Anyway, let's talk about this soon. I think it's a problem for Torque that the developer onboarding experience seems to be so steep. We need to solve that, but our solution to that problem doesn't necessarily need to have any implications for production deployment.

@kfogel

This is something we would do to make development easier. It wouldn't affect our recommended production deployment method.

I'd recommend if we move to using docker in development, that we use it in production as well. I'd see no reason not to at that point.

Therefore, could the container essentially be a ready-to-go Debian machine that the developer spins up and then points our usual ansible scripts at? (I'm leaving it deliberately vague whether "ready-to-go" includes having Mediawiki already present or not; I don't think that's the most important question here, though it's one we'd have to decide eventually).

I'd recommend that if we switch to docker that we make the investment to create our build system using docker as well, rather than trying to shove our current. I'm not sure who our production hosting provider is, or if we maintain out own hosting infrastructure, but almost every single hosting provider out there provides managed container hosting such as Amazon ECS.

@frankduncan

The goal is to apt update/upgrade and do the entire system for all security updates, agnostic of what's on the system. We won't meet that goal fully, but our hope is to get as close as possible.

The thing with using docker is that we won't be running these commands manually ourselves which I'd argue is a benefit. Our container image, which depends on mediawiki, which depends on php-apache, which depends on debian-buster. If a security update is pushed to debian or any application inbetween, updating is simply just rebuilding the image. Most managed services (such as ECS) even have a "refresh" button to do this and handle the re-deploy for you.

On the other hand, then Mike could in theory have grabbed any available standard Debian container image and used that (as I describe above) for development, and he chose not to do that.

For what it's worth, I considered doing this, however, converting all of our build configurations into docker as my first task seemed like a pretty ambitious task and with our need to show progress being pretty clear, I didn't want to waste too much time working on something which would not be something we could show to the client. I also attempted running a Debian VM locally and simply treating it as a local server, however this ended up eating up a lot of resources on my laptop so I figured given the trouble I was having to follow existing methods of setting up torque used by other team members.

All in all, I think using containers could streamline development, make onboarding new team members easier, and would have little to no extra cost on existing operations. However, the effort of switching to containers is significant, we would have to re-write our build system, there would be a learning curve as we learn how they work, and we would have to switch over our production environment to use them. I'm not really in a position to advise as to whether that cost is worth the benefits containers would bring to us but I do think there's a real evidenced benefit this technology could bring to our project.

If a security update is pushed to debian or any application inbetween, updating is simply just rebuilding the image. Most managed services (such as ECS) even have a "refresh" button to do this and handle the re-deploy for you.

Hm, I'm thinking bigger picture. Right now OTS hosts N services across M projects. The goal is to separate system administration of the packages underlying those services from the application development and deployment. So we want to iterate toward a single production management strategy to make that person's job somewhat standardized. Last this was discussed, the route we were going was to use full machines (by which I mean a machine you have a user account on as opposed to a cluster that runs various docker instances) with Debian and its package management system for system updates, and ansible for application deploy. To add a unique package management solution through docker for this project alone may not fold into that overall vision, and to move the entire infrastructure to a docker based one may not be feasible for various reasons.

This is something we would do to make development easier. It wouldn't affect our recommended production deployment method.

I think that maintaining two different deployment mechanisms for a project this small is going to be a challenge. I know that I, for one, will always mirror production in order to be able to find issues that may show up in production. Which is why I use Debian now instead of gentoo for my OTS work. If production deploy != development deploy, then developers will need to target both deployment options for changes, which increases moving parts and testing load.

Mike's solution has been to stand up an AWS server, which I think is both a perfectly reasonable solution under the circumstances and simultaneously kind of ridiculous: why in the heck would a developer need to active a remote machine just to work on Torque?

My experience with software is that if your development isn't mirroring production, you're going to eventually run into problems. So it's not at all ridiculous to provision a machine that mirrors the system in production. There's no requirement that that's a remote machine, but it's a quick and easy way to get a debian server up and running if you don't want to run a VM on your local machine, and you don't want to buy new hardware.

And this discussion is all just torque. We haven't even gotten into what it would take to dockerize a competition install.

One (very early in the consuming all of this phase) question / comment:

I understand that a primary argument against containerization is the additional level of abstraction. There is likely going to be some layer of abstraction anyway for the developer / maintainer since (1) developers who aren't running on Debian need to be doing some backflips to access their development machines and (2) even if a new developer does run Debian, this system seems very opinionated around things like software versions / etc and I know I personally have some unease about running Ansible scripts on my personal machine.

In my case I'm going to set up a VM in the mean time, though in (many ways that seems like a more bloated container).

Just food for thought!

OpenTechStrategies / torque

Containerizing Torque #37

Overview

The Problem

My Proposal

What would be needed