BCDevOps / OpenShift4-RollOut

This is the primary board for all activities related to the roll out of OpenShift 4
Apache License 2.0
0 stars 2 forks source link

Image Management Plan #362

Closed jleach closed 3 years ago

jleach commented 4 years ago

The Problem

As part of the OCP Security Project the Platform Services (PS) team recognized that many of the images in the openshift namespace are not actively maintained. This is problematic because they are used by other teams leading to the propagation of security vulnerabilities; this will cause widespread security issues that will not be manageable when Aqua is put in play.

The Solution

Too greatly improve our security posture while not making our customers lives overly complicated or negatively impacting the platforms usability will require thinking about image management differently. Rather than one vast dumping ground for images, curated or not, we’ll use the following:

1. Sample Images

The sample images shipped in the openshift namespace are a convenient way to spin up an image for testing or for actual production use and because of teams’ history with OCP3.11; many deployment manifests will be configured to pull images from this namespace. Going forward, the openshift namespace will be highly curated. This will help disuade teams for implementing bad practices and stream line images (separate the wheat form the chaff).

The strategy above will be maintained on all clusters.

2. Bespoke Images

When teams “run a build” the output is an image, stored in an image stream, that can be run on the cluster. Typically teams will store images in their tools namespace and tag them with dev, test, or prod to trigger a deployment as per a BuildConfig. While there are variations on this strategy, the main take away is that teams will be encouraged to store images on the cluster and within their namespaces for use.

When Artifactory becomes available we require all images run in a *-prod to originate from Artifactory because a centralized image repos is essential for multi-cluster deployments.

The workflow will be:

Open Questions:

3. Curated Images

Platform services will create a bcgov repository on all clusters where images curated by the PS team will be stored. These will include, for example, any builder images or common services like the container backup images or our preferred configuration for Patroni. The images will be built in a single location (lab cluster) and propagated to the same image repository on all other clusters.

There will be a few conventions for all images in this namespace including:

Community contributions will be welcome here providing they meet the same guidelines and are actively supported. It will be important for community contributions to have a PS champion because community team members come and go. The PS team member will need to be able to rebuild community around abandon images or make the decision to deprecate the image and remove the image.

Image Security

Aqua will be used to scan images and enforce security policies set by the PS team in consultation with other CITZ security teams. It will be too resource intensive for Aqua to scan all images periodically. To manage security and resource usage the following process will be implemented:

UPDATE as of Oct 22, 2021: We have two builds that are kept up to date and are available in all clusters for whoever wants to use them:

patroni postgres 12.4 mongodb 3.6 assessment-app from https://github.com/bcgov/AppAssessment/tree/main/build

Depending on the level of interest, we could look at providing a newer version of mongodb. These builds get automatically built once a month and so are kept up to date.

In each cluster, they are available as:

image-registry.openshift-image-registry.svc:5000/bcgov/patroni-postgres:12.4-latest image-registry.openshift-image-registry.svc:5000/bcgov/mongodb-36-ha:1

The images are built and copied to our team’s Docker Hub account, whence CCM copies them to each cluster’s bcgov namespace. We can recommend that teams use these images instead of maintaining their own, if they’re okay with an image that is updated automatically (good for security).

jleach commented 4 years ago

Notes from Meeting on 2020-09-11

cli cli-artifacts installer installer-artifacts must-gather oauth-proxy tests tools

jleach commented 4 years ago

Notes from Meeting on 2020-09-18 Cailey, Jason, Jeff, Justin, Steven.

Action Items:

jleach commented 3 years ago

Meeting minutes from todays meeting w/ Olena, Justin, Cailey, and Jason.

Agenda

Minutes

Action Items

jefkel commented 3 years ago

When the rate limiting was first announced, a decision that the only "support" work the platform services team was responsible for was to create education documentation for teams on how to configure an individual docker credential to be used in a namespace. This is covered by the documentation efforts above, as well as the (eat your own dogfood) having platform services namespaces use a private credential where needed (looks like a shared credential for platform needs has been identified above)

Artifactory future caching access

jleach commented 3 years ago

@jefkel Can't answer all questions but:

how does the docker rate limit affect our plans to use artifactory as a caching pull-through?

I think the plan didn't change.

What changes to Artifactory are needed re: caching services to assist? (image storage allocation, cached image expiry, etc?)

Not much has changed w/ Artifactory per se. I think we touched on revisiting this solution when Artifactory goes HA / enterprise ready.

Can we start gathering metrics on how well the caching will reduce the dockerhub pulls now? (what do we need to gather?)

Probably cost more than what we'll pay for Docker Hub.

What service paperwork needs to be done? (STRA? RBAC Access models? Service definition/limitations/etc?)

Good point. We're using DH now so this question is applicable regardless of if we pay or not.

Do we need a paid account? (200 unique image pulls every 6 hours "might" work?)

We touched on this point. I think the consensus was $25USD/mo was insignificant for an enterprise solution so fretting about "might" wasn't worth it. Peace of mind. Its month-2-month so we can see if they provide any metrics to answer the question.

What specific services are to be available through the managed access? (paid accounts have a lot more services than just pulling) (if just pulling through Artifactory, does the "pro" account fit?)

We identified two key features: First, was the erasing limits on pulls for when we go GA. When we go GA will up the load on the free Artifactory account might be an issue. Also, its a bit hokey for staff to create faux user accounts critical infrastructure ; if we had to reboot all nodes (like we've done a few times in the least year) then each one will pull a minimum of three (Apo, Aqua, Sisdig) images. With 30 nodes we'd use 90 of our 200 pulls then we have multiple clusters. Becomes a headache. The second was a place to store private infra images until Artifactory is HA.

jefkel commented 3 years ago

Thanks Jason, was just trying to make sure the question's were out there (not requiring answers all at once). The key message I was trying to get out there is that this doesn't have to be a "fix immediately!" issue. We have a perfectly acceptable solution, and all the extra work can be done via the Artifactory Service design work without rushing anything.

re: use of DockerHub now (anonymous or bring your own account) is just consumption of a cloud service. The platform team is NOT providing/gating the service, and each team is responsible for it's own use/consumption of the cloud service.

future (use a shared platform account for hosted applications) is the platform team providing access to a cloud service (the platform team takes on additional responsibility in the provisioning of that service) - yes, paid or unpaid.. although paid usually moves toward having additional requirements forced on the platform from the gov policy.

re: place to push/store private infra images - overkill? The team already has multiple registries that can hold infra images while the artifactory service is developed. I don't think pushing images to dockerhub was ever part of the platform service description either... (this usage should trigger additional STRA and other paperwork requirements)

jleach commented 3 years ago

@jefkel yes, mostly. We now have an account so the nodes can pull when rebooted. Probably need something in place just before we go into GA on Dec 2 for the pull through cache. But nothing critical ATM.

now (anonymous or bring your own account) is just consumption of a cloud service. The platform team is NOT providing/gating the service, and each team is responsible for it's own use/consumption of the cloud service.

Yup. And I think this will be in the future also. But, if we're offering Artifactory as a pull through service we want to make sure it doesn't hit the wall. Other than that, paying for Docker Hub is just an us thing. Its not meant for anyone outside of P.S.

re: place to push/store private infra images - overkill?

Maybe? I don't do much at this level so it might be better to ask JP where he's keeping manufactured infra images.

StevenBarre commented 3 years ago

For infra stuff, we can just put them in the local registry https://docs.openshift.com/container-platform/4.5/openshift_images/image-streams-manage.html#images-imagestreams-import_image-streams-managing

Then change Aporeto, Sysdig, etc to just pull from the local repo.

jleach commented 3 years ago

@sbarre-esit What if the clusters down? Didn’t we kill it off last year with a rogue Ansible play book?

StevenBarre commented 3 years ago

Ugh, why you gotta remind me of my greatest embarrassment 🤦

If we are using imagePullPolicy: IfNotPresent then most of the time the image will be cached locally on nodes when they come back up. Once the registry is restored then any straggler nodes can pull their image and finish starting up.

But that is a good argument for just using off-cluster registry for critical things. https://quay.io/plans/ is another option to dockerhub. That's what hosts a lot of the OCP images.