PHACDataHub / sci-portal

1 stars 0 forks source link

Data Science Portal

This repository contains a reference implementation of a web app that provides self-service capabilities for users to provision infrastructure in Google Cloud. The contract specified the use of a GitOps approach initiated from Backstage templates, with changes reconciled using Config Sync. The infrastructure is defined using a combination of Crossplane, Config Connector, and Terraform.

The reference implementation is split between two repositories:

Contents

Features

Authentication

Users sign in using PHAC's designed Google authentication methods.

[!IMPORTANT] Users must be added to the Backstage Catalog before they can log in. This is a known limitation documented in User Management.

User Self-Service

Users can visualize the available templates on the Create... page. The templated deployments are prompt the user for their team, administrative details, and cloud resource configuration.

To provision a resource, select a template, fill out the required information, and submit the form. The user is provided with a link to view the Pull Request that is created.

GitOps Approach

In a GitOps approach the repository serves as the source of truth. After bootstrapping the cluster to start Config Sync, we use GitOps to define the desired state of the cluster. Config Sync reconciles the current state in the cluster and the desired state in the repositories.

When a user creates a templated deployment from Backstage it creates a Pull Request in the PHACDataHub/sci-portal-users repository.

Deployment Status

When the Pull Request is merged, each templated deployment instance will appear in the Backstage Catalog and can provide helpful links. For example, the RAD Lab Data Science templates how a link to the Managed Notebooks in the Vertex AI Workbench.

Monitoring the current deployment status was not a prioritized feature. The next steps for development are documented in the Extensibility Report.

Deployment Isolation

The templated deployments provision resources in a new isolated GCP project in alignment with the agency’s micro-segmentation security architecture.

Cost Visibility

Users can see the project Cost and % Budget in the Backstage Catalog. These values are updated daily.

Budget Management

Each project is actively monitored for consumption. Budget alert emails are sent when the budget reaches 25%, 50%, 75%, 90%, 95%, and 100%. Over-budget alert emails are sent for each percent between 100% and 120%.

Budget Reporting

A Looker Studio dashboard has been embedded on the Cost Dashboard page. This offers a flexible starting point that the team can refine to build a meaningful FinOps reports that meets their needs. Each project is labeled with the cost centre and display name to support reporting grouped by cost centre.

The billing data is exported daily to BigQuery for additional analysis.

Extensibility

The overall solution is extensible. It supports adding Software Templates, displaying custom information and links in the Catalog, adding custom actions to the Catalog, extending the permissions model, and much more. This is documented in the Extensibility Report.

Contributing

Directories

This repository contains the following directories:

Directory Description
.devbox This directory contains the Devbox configuration to install gcloud in an isolated shell for development.
backstage This directory contains Backstage, including custom plugins and template definitions.
bootstrap This directory contains the scripts and infrastructure definitions to deploy Google Kubernetes Engine (GKE), Crossplane, and the Crossplane providers.
budget-alerts This directory contains a Cloud Function that sends budget alert emails with GC Notify.
root-sync This directory contains Kubernetes manifests and Kustomizations reconciled by Config Sync.
taskfiles This directory contains Task definitions used by Task.
templates This directory contains Terraform modules managed and modified by the Data Science Portal team.
tests This directory contains chainsaw tests that verify the Crossplane Compositions create the expected Kubernetes resources.

Environments

At this time there is only one non-production environment. The reference implementation is not intended for production. Some concerns for moving to production have been documented in the Extensibility Report.

We encourage creating at least one more non-production environment to develop and maintain the system with confidence.

Initial Setup (Bootstrapping)

The cluster must be deployed with Config Sync for GitOps, Crossplane for the control plane, and additional infrastructure before we can build and run Backstage. This is only required the first time the cluster starts. The process is documented in bootstrap/README.md.

Prerequisites

Taskfile

Install task following the documentation. To install globally using Yarn run:

yarn global add @go-task/cli

To verify the installation and list available tasks run:

task --list

Node.js

This project uses more than one version of Node.js (v18 and v20). We recommend using a tool that manages multiple versions of Node.js like nvm (Node Version Manager) or NVM for Windows.

To use the version of Node.js defined in the .nvmrc file run:

nvm use

Yarn v1 (Classic)

Backstage uses Yarn v1. It can be installed globally using corepack:

corepack enable
corepack prepare yarn@1.22.22 --activate

or installed globally:

npm install --global yarn@1.22.22

Verify the installation by checking the Yarn version:

yarn -v

Guidelines

Regions

Deploy GCP infrastructure in Canadian regions. Use northamerica-northeast1 (Montreal) or northamerica-northeast2 (Toronto).

Crossplane Providers

Follow these principles to define infrastructure using Config Connector or Terraform as managed resources in your Crossplane compositions:

If Terraform modules need to be modified, copy them to the templates/ directory. The RAD Lab Data Science and Gen AI modules have been copied and modified there.

Versioning

We recommend the following approach to modifying CompositeResourceDefinitions and Compositions that are used in production:

This approach enables the team to test changes to manifests with confidence, the progressively roll out and upgrade the remaining resources.

Test The Infrastructure

Define infrastructure with confidence using tests. There are corresponding chainsaw tests in the tests/templates/ directory that can be used to apply manifests and assert the expected cluster resources are provisioned.

Keeping Tools Updated

Backstage

Follow the documentation to use the backstage-cli to update Backstage.

The Backstage installation was created from a template, then modified. To keep up to date with changes to the template follow the documentation and use the Backstage Upgrade Helper.

Config Sync

Review the Release Notes and documentation to upgarde Config Sync.

Crossplane

Review the release notes on docs.crossplane.io or GitHub, and follow the documentation to upgrade Crossplane.

Crossplane Provider for Kubernetes

Review the release notes and README to manually upgrade provider-kubernetes in root-sync/base/crossplane/project/kubernetes.yaml.

Crossplane Provider for Terraform

Review the release notes and documentation to manually upgrade provider-terraform in bootstrap/crossplane/templates/terrafrom/provider.yaml.

[!WARNING]
The RAD Lab Terraform modules require the gcloud CLI tool to be availble where terraform is run. We configure provider-terraform to use a custom runtime image defined in bootstrap/crossplane/templates/terrafrom/build.

Update PROVIDER_TERRAFORM_VERSION in the Dockerfile.

gcloud

[!WARNING]
The RAD Lab Terraform modules require the gcloud CLI tool to be availble where terraform is run. We configure provider-terraform to use a custom runtime image defined in bootstrap/crossplane/templates/terrafrom/build.

Review the release notes and update CLOUD_SDK_VERSION in the Dockerfile.

Config Connector

Review the release notes and documentation to upgrade Config Connector.

Troubleshooting

Config Sync

If the Config Sync sync status appears stuck on Google Console check the root-reconciler Pod logs in the config-management-system namespace.

Crossplane

These references will help troubleshoot Crossplane:

Terraform Provider for Crossplane

To view the logs when the Terraform Provider runs plan and apply:

Kubernetes Provider for Crossplane

The resources in a Crossplane Composition must be managed resources. To create a Kubernetes resource, they must be wrapped in the Object managed resource. Crossplane will not create Kubernetes resources directly.