jleach commented 4 years ago

The Problem

As part of the OCP Security Project the Platform Services (PS) team recognized that many of the images in the openshift namespace are not actively maintained. This is problematic because they are used by other teams leading to the propagation of security vulnerabilities; this will cause widespread security issues that will not be manageable when Aqua is put in play.

The Solution

Too greatly improve our security posture while not making our customers lives overly complicated or negatively impacting the platforms usability will require thinking about image management differently. Rather than one vast dumping ground for images, curated or not, we’ll use the following:

1. Sample Images

The sample images shipped in the openshift namespace are a convenient way to spin up an image for testing or for actual production use and because of teams’ history with OCP3.11; many deployment manifests will be configured to pull images from this namespace. Going forward, the openshift namespace will be highly curated. This will help disuade teams for implementing bad practices and stream line images (separate the wheat form the chaff).

The openshift namespace will start empty, and contain no images;
Once we determine that an image should be added to this namespace it will be manually added;
Once manually added it will be automatically maintained by OCP so that updates are automatically imported.

The strategy above will be maintained on all clusters.

2. Bespoke Images

When teams “run a build” the output is an image, stored in an image stream, that can be run on the cluster. Typically teams will store images in their tools namespace and tag them with dev, test, or prod to trigger a deployment as per a BuildConfig. While there are variations on this strategy, the main take away is that teams will be encouraged to store images on the cluster and within their namespaces for use.

When Artifactory becomes available we require all images run in a *-prod to originate from Artifactory because a centralized image repos is essential for multi-cluster deployments.

The workflow will be:

Teams will build images in their *-tools namespace;
Teams will manage how images are promoted (copy to dev, test, prod or just reference tools);
Without Artifactory, teams will need to propagate these image to other clusters.
With Artifactory, teams will need to propagate images to Artifactory for production deployments.

Open Questions:

Can we configure other clusters to import teams images every n-minutes?

3. Curated Images

Platform services will create a bcgov repository on all clusters where images curated by the PS team will be stored. These will include, for example, any builder images or common services like the container backup images or our preferred configuration for Patroni. The images will be built in a single location (lab cluster) and propagated to the same image repository on all other clusters.

There will be a few conventions for all images in this namespace including:

All images need to have a primary PS contact / owner;
All images must be periodically rebuilt as part of a scheduled CI/CD pipeline;
Images follow a semantic versioning guideline.

Community contributions will be welcome here providing they meet the same guidelines and are actively supported. It will be important for community contributions to have a PS champion because community team members come and go. The PS team member will need to be able to rebuild community around abandon images or make the decision to deprecate the image and remove the image.

Image Security

Aqua will be used to scan images and enforce security policies set by the PS team in consultation with other CITZ security teams. It will be too resource intensive for Aqua to scan all images periodically. To manage security and resource usage the following process will be implemented:

Teams will be given the ability to scan images produced as part of their build;
Aqua will scan all images run in *-prod namespaces and block new images with excessive security vulnerabilities from being run.

UPDATE as of Oct 22, 2021: We have two builds that are kept up to date and are available in all clusters for whoever wants to use them:

patroni postgres 12.4 mongodb 3.6 assessment-app from https://github.com/bcgov/AppAssessment/tree/main/build

Depending on the level of interest, we could look at providing a newer version of mongodb. These builds get automatically built once a month and so are kept up to date.

In each cluster, they are available as:

image-registry.openshift-image-registry.svc:5000/bcgov/patroni-postgres:12.4-latest image-registry.openshift-image-registry.svc:5000/bcgov/mongodb-36-ha:1

The images are built and copied to our team’s Docker Hub account, whence CCM copies them to each cluster’s bcgov namespace. We can recommend that teams use these images instead of maintaining their own, if they’re okay with an image that is updated automatically (good for security).

jleach commented 4 years ago

Notes from Meeting on 2020-09-11

Need to figure out how to integrate CICD with Aqua; no Aqua on Day 1.
Image versioning sample: patroni-postgres-96:v1.1; We will, in general, follow RHs name and version format (ie https://catalog.redhat.com/software/containers/rhel8/postgresql-96/5ba0ad1f5a134643ef2eeb9d).
Jeff / Cailey: Req. meeting to hammer out life custom image life cycle:
- How long will we keep a version around? Clecio suggests max 3.
- How do we engage the community to take over image management if they want additional features such as update to patronie. We'll maintain CICD pipeline and distribution.
Should not advertise that we have imagestreamtag sync enabled. We may require teams to use Artifactory for pulling in external images.
Need to look into how supporting images are used for builds (ie S2I) so the cluster can build if the external images are unavailable. (ref. Image: registry.redhat.io/openshift3/ose-docker-builder:v3.11.232).
Images in openshift namespace required for the cluster:

cli cli-artifacts installer installer-artifacts must-gather oauth-proxy tests tools

Do we need to do anything for Jenkins to work?
We would like to optimize templates so that they play nice on the cluster.
Remove Jenkins from catalog and make ours avail from bcgov which may or may not be changed; the template will be for sure.
Our existing pipeline: https://jenkins-bcgov-tools.pathfinder.gov.bc.ca/

jleach commented 4 years ago

Notes from Meeting on 2020-09-18 Cailey, Jason, Jeff, Justin, Steven.

Add why use ephemeral or pers storage - why and when;
Existing Jenkins will be OK for teams to use;
Look into randomized UUID Q asked in RC. Need a FAQ to put common asked questions. <-- OCP4;
Making a commonly used image saves many teams having to do the same work. P.S. should keep this to a minimum;
- - Document release cadence(sp?);
- - They need to accommodate our release schedule. We will deprecate images that teams may not be happy with.
- - Telegraph we're just building an image. We don't support deployments of a particular image.
- - Message needs to be clear WRT expectations. Teams can use our template/Dockerfile if they don't like it.
Why not remove all templates so that if teams don't use poorly configured templates (i.e should not use Postgres but should be using Patroni). The templates don't have very good configuration and we don't want teams to use oc new-app because that's not production ready (will cause more problems than it's worth).
- - Remove them for early access. Will let us figure out the developer experience in an agile way. Then, based on feedback we will add them in.
On-boarding we need to be more discipline at pushing back teams to solve issues. <-- OCP4.
We won't have Techton (?sp?) on OCP4; RH does not have a new OCP4 pipeline avail to them. Will be in GA later, we'll probably offer it once in GA, teams can go test it on a non-gov cluster if they want to learn it.
Jenkins (an off-cluster CICD like GitHub Actions) are available for Day 1.
Argo Workflow / Argo Actions will be available; we need to decide if we will make this available to everyone as a supported service. Need to be clear about SLA.
Patroni a good first candidate for an image because we want teams to have HA db; important on OCP4 because of the way the cluster is upgraded (no control on how nodes drain).
Backup images another good candidate because teams wont' backup if its "hard".

Action Items:

Jason to setup initial image CICD for Caddy S2I. Others will comment on it and when happy it we'll create a Patroni and Backup container image.
Jason to update ticket indicating all templates to be removed from catalog. We'll start with Zero, add in Jenkins (custom template form us) and if teams request additional templates add them in, again with sane defaults.

jleach commented 3 years ago

Meeting minutes from todays meeting w/ Olena, Justin, Cailey, and Jason.

Check In
Summary
Agenda

Agenda

Will the platform restart if it crashed right now?
Who owns this problem?
Can we go enterprise?

Minutes

Justin / Jeff talked about (in a previous meeting today): a) Using our own team credentials; b) Copying our images to Artifactory private repo.
Down side to (b) Artifactory not at enterprise level.
Private repos will be a little while because number one priority is to have a complete pipeline (we do not ATM).
Private repo blockers: Operator need validation; Have not looked at quotas; We need to decide about the service offering.
We have a platform services ops repo at the moment. This will help ensure the platform will always recover.
Artifactory runs on the platform so if the platform is down, artifactory is down.
Artifactory is in klab (test instance);
Cailey and Afan (Arctiq) already both working on pipeline. Advise keeping both focused ATM once design work is done (this sprint).
Call the org on DockerHub bcgov, if not then use bcdevops;
Need a naming convention for accounts: -- bcdevops_artifactory (Artifactory Pull Through Cache) -- bcdevops_cluster (Added to each ops namespace as part of CCM) -- bcdevops_ps (Shared with the PS team as needed) -- bcdevops_admin (Owns w/ Full Access)

Action Items

[ ] Justin / William to use private credentials on all nodes to ensure all our images will pull as needed;
[x] Jason to write a blurb justifying why buy an org for Docker Hub;
[x] Olena to get approval for a $25/mo;

jefkel commented 3 years ago

When the rate limiting was first announced, a decision that the only "support" work the platform services team was responsible for was to create education documentation for teams on how to configure an individual docker credential to be used in a namespace. This is covered by the documentation efforts above, as well as the (eat your own dogfood) having platform services namespaces use a private credential where needed (looks like a shared credential for platform needs has been identified above)

Artifactory future caching access

how does the docker rate limit affect our plans to use artifactory as a caching pull-through?
What changes to Artifactory are needed re: caching services to assist? (image storage allocation, cached image expiry, etc?)
Can we start gathering metrics on how well the caching will reduce the dockerhub pulls now? (what do we need to gather?)
If we are providing authenticated access through a managed account:
- What service paperwork needs to be done? (STRA? RBAC Access models? Service definition/limitations/etc?)
- Do we need a paid account? (200 unique image pulls every 6 hours "might" work?)
- What specific services are to be available through the managed access? (paid accounts have a lot more services than just pulling) (if just pulling through artifactory, does the "pro" account fit?)

jleach commented 3 years ago

@jefkel Can't answer all questions but:

how does the docker rate limit affect our plans to use artifactory as a caching pull-through?

I think the plan didn't change.

What changes to Artifactory are needed re: caching services to assist? (image storage allocation, cached image expiry, etc?)

Not much has changed w/ Artifactory per se. I think we touched on revisiting this solution when Artifactory goes HA / enterprise ready.

Can we start gathering metrics on how well the caching will reduce the dockerhub pulls now? (what do we need to gather?)

Probably cost more than what we'll pay for Docker Hub.

What service paperwork needs to be done? (STRA? RBAC Access models? Service definition/limitations/etc?)

Good point. We're using DH now so this question is applicable regardless of if we pay or not.

Do we need a paid account? (200 unique image pulls every 6 hours "might" work?)

We touched on this point. I think the consensus was $25USD/mo was insignificant for an enterprise solution so fretting about "might" wasn't worth it. Peace of mind. Its month-2-month so we can see if they provide any metrics to answer the question.

What specific services are to be available through the managed access? (paid accounts have a lot more services than just pulling) (if just pulling through Artifactory, does the "pro" account fit?)

We identified two key features: First, was the erasing limits on pulls for when we go GA. When we go GA will up the load on the free Artifactory account might be an issue. Also, its a bit hokey for staff to create faux user accounts critical infrastructure ; if we had to reboot all nodes (like we've done a few times in the least year) then each one will pull a minimum of three (Apo, Aqua, Sisdig) images. With 30 nodes we'd use 90 of our 200 pulls then we have multiple clusters. Becomes a headache. The second was a place to store private infra images until Artifactory is HA.

jefkel commented 3 years ago

Thanks Jason, was just trying to make sure the question's were out there (not requiring answers all at once). The key message I was trying to get out there is that this doesn't have to be a "fix immediately!" issue. We have a perfectly acceptable solution, and all the extra work can be done via the Artifactory Service design work without rushing anything.

re: use of DockerHub now (anonymous or bring your own account) is just consumption of a cloud service. The platform team is NOT providing/gating the service, and each team is responsible for it's own use/consumption of the cloud service.

future (use a shared platform account for hosted applications) is the platform team providing access to a cloud service (the platform team takes on additional responsibility in the provisioning of that service) - yes, paid or unpaid.. although paid usually moves toward having additional requirements forced on the platform from the gov policy.

re: place to push/store private infra images - overkill? The team already has multiple registries that can hold infra images while the artifactory service is developed. I don't think pushing images to dockerhub was ever part of the platform service description either... (this usage should trigger additional STRA and other paperwork requirements)

jleach commented 3 years ago

@jefkel yes, mostly. We now have an account so the nodes can pull when rebooted. Probably need something in place just before we go into GA on Dec 2 for the pull through cache. But nothing critical ATM.

now (anonymous or bring your own account) is just consumption of a cloud service. The platform team is NOT providing/gating the service, and each team is responsible for it's own use/consumption of the cloud service.

Yup. And I think this will be in the future also. But, if we're offering Artifactory as a pull through service we want to make sure it doesn't hit the wall. Other than that, paying for Docker Hub is just an us thing. Its not meant for anyone outside of P.S.

re: place to push/store private infra images - overkill?

Maybe? I don't do much at this level so it might be better to ask JP where he's keeping manufactured infra images.

StevenBarre commented 3 years ago

For infra stuff, we can just put them in the local registry https://docs.openshift.com/container-platform/4.5/openshift_images/image-streams-manage.html#images-imagestreams-import_image-streams-managing

Then change Aporeto, Sysdig, etc to just pull from the local repo.

jleach commented 3 years ago

@sbarre-esit What if the clusters down? Didn’t we kill it off last year with a rogue Ansible play book?

StevenBarre commented 3 years ago

Ugh, why you gotta remind me of my greatest embarrassment 🤦

If we are using imagePullPolicy: IfNotPresent then most of the time the image will be cached locally on nodes when they come back up. Once the registry is restored then any straggler nodes can pull their image and finish starting up.

But that is a good argument for just using off-cluster registry for critical things. https://quay.io/plans/ is another option to dockerhub. That's what hosts a lot of the OCP images.

BCDevOps / OpenShift4-RollOut

Image Management Plan #362

The Problem

The Solution

1. Sample Images

2. Bespoke Images

3. Curated Images

Image Security

Agenda

Minutes

Action Items

Artifactory future caching access