QCDIS / NaaVRE-dev-environment

Integrated development environment for NaaVRE
Apache License 2.0
0 stars 0 forks source link

NaaVRE-dev-environment

Integrated development environment for NaaVRE.

OS support:

Getting started

Run these steps once, when setting up the environment.

Git setup

To integrate the different components of NaaVRE, we use Git submodules:

git clone --recurse-submodules git@github.com:QCDIS/NaaVRE-dev-environment.git

If you get an error:

Cloning into 'NaaVRE-dev-environment'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

then you need to add your ssh key to your GitHub account. Follow the instructions here.

Check out the Git Submodules documentation.

Conda environment

Install Conda from these instructions: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Setup a new conda environment and install dependencies:

cd NaaVRE-dev-environment
conda env create -n naavre-dev --file environment.yml
conda activate naavre-dev

Pre-commit hooks

To install and enable pre-commit hooks, run:

conda activate naavre-dev
pre-commit install
ggshield auth login

Helm dependencies

During the initial setup, and after updating submodules/VREPaaS-helm-charts, run:

helm dependency build services/vrepaas/submodules/VREPaaS-helm-charts

GitHub repository for building cells

To containerize cells from this dev environment, you need to set up a personal GitHub repository. It will be used to commit the cells code and build and publish the container images:

  1. Create your repository from the QCDIS/NaaVRE-cells template, and follow instructions from its README file to generate an access token.
  2. Set the values of CELL_GITHUB and CELL_GITHUB_TOKEN in ./services/naavre/helm/values-integration.yaml and ./services/naavre-dev/helm/values-dev.yaml.

Minikube cluster

The NaaVRE component are deployed by tilt to a minikube cluster. There are two options for running minikube: using a pre-configured NaaVRE-dev-vm, using a self-managed Minikube cluster.

Using a pre-configured VM (NaaVRE-dev-vm)

If you are provided with a development VM, follow these instructions: Using the VM (for developers).

Using a self-managed Minikube cluster

We use ingress-dns to access the resources deployed on the minikube cluster. To configure it, start minikube (minikube start), and follow step 3 section of the minikube ingress-dns setup guide. Choose your operating system.

For Linux, pick the configuration matching your distribution's DNS setup. To find the DNS setup, run head /etc/resolv.conf:

Run the dev environment

Run these steps every time you want to start the dev environment.

Start minikube

(Skip this step if you are using a Using a pre-configured VM (NaaVRE-dev-vm).)

minikube start  --addons=ingress,ingress-dns
# Optional:
minikube dashboard --url

Nginx Ingress Monitoring

To enable metrics exporting from the ingress contorller modify the deployment and service called ingress-nginx-controller by patching the deployment and service with the following commands:

kubectl patch deployment  ingress-nginx-controller -n ingress-nginx --patch-file services/kube-prometheus-stack/patches/ingress-nginx-controller-deployment-patch.yaml
kubectl patch service ingress-nginx-controller  -n ingress-nginx --patch-file services/kube-prometheus-stack/patches/ingress-nginx-controller-service-patch.yaml

These patches are based on the following guide.

To check the metrics are being exported get the nodePort mapped to '10254'. If for example the nodePort is '30361' you can access the metrics from: http://naavre-dev.minikube.test:30361/metrics. The output should be similar to the following:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.971e-05
go_gc_duration_seconds{quantile="0.25"} 2.8325e-05
go_gc_duration_seconds{quantile="0.5"} 5.6258e-05
go_gc_duration_seconds{quantile="0.75"} 0.000102628
go_gc_duration_seconds{quantile="1"} 0.000131488
go_gc_duration_seconds_sum 0.001229649
go_gc_duration_seconds_count 20
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 99
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.20.5"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 6.398712e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 8.6640184e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.473744e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 638149

Start the services needed by NaaVRE

tilt up

This will open the Tilt dashboard in your browser, and deploy the services needed by NaaVRE.

After starting Tilt, we need to configure the connection between Argo and the VREPaaS. To that end, open a terminal and run token=$(kubectl get secret vre-api.service-account-token -o jsonpath='{.data.token}' | base64 -d); echo "Bearer $token". Next, edit services/vrepaas/helm/values.yaml add the output of the previous command (Bearer ey.....) to global.argo.token.

After updating the helm values, open the Tilt web interface, wait for the Tiltfile resource to update, and trigger a manual update on vrepaas-vreapi.

Start NaaVRE

There three options for starting NaaVRE

Option 1: Run NaaVRE locally

Run the NaaVRE dev server locally (e.g. from a separate clone of the repository).

To that end, follow the instructions from https://github.com/QCDIS/NaaVRE/blob/main/README.md#development, creating the file export_VARS containing:

export API_ENDPOINT="https://naavre-dev.minikube.test/vre-api-test"
export ARGO_WF_SPEC_SERVICEACCOUNT="executor"
export CELL_GITHUB="<the repository you created above>"
export CELL_GITHUB_TOKEN="<the token to access this repo>"
export JUPYTERHUB_SINGLEUSER_APP="jupyter_server.serverapp.ServerApp"
export JUPYTERHUB_USER="user"
export MODULE_MAPPING_URL="https://raw.githubusercontent.com/QCDIS/NaaVRE-conf/main/module_mapping.json"
export NAAVRE_API_TOKEN="token_vreapi"
export PROJ_LIB="/venv/share/proj"
export SEARCH_API_ENDPOINT=""
export SEARCH_API_TOKEN=""
export VLAB_SLUG="n-a-a-vre"
export VRE_API_VERIFY_SSL="false"

(Fill in your values for CELL_GITHUB and CELL_GITHUB_TOKEN.)

This option is recommended when developing NaaVRE Jupyter lab extensions, because it provides the fastest reloading on code changes.

Note: when using this option, NaaVRE should be accessed through the direct link. Launching virtual labs from the VREPaaS UI will not work.

Option 2: Run NaaVRE with Tilt (Jupyter Lab only)

Run a dev image of NaaVRE with Tilt, built from ./services/naavre/submodules/NaaVRE. This option deploys NaaVRE as a standalone Jupyter Lab service.

To that end, run the following command (make sure tilt up is running):

tilt enable n-a-a-vre-dev

This option is recommended when jointly developing NaaVRE and the VREPaaS, if you don’t need to test integration between NaaVRE Jupyter hub or Keycloak.

Note: when using this option, NaaVRE should be accessed through the direct link. Launching virtual labs from the VREPaaS UI will not work.

Option 3: Run NaaVRE with Tilt (Jupyter Hub integration)

Similar to option 2, but NaaVRE is deployed through Jupyter Hub.

To that end, run the following command (make sure tilt up is running):

tilt enable n-a-a-vre-dev hub proxy user-placeholder user-scheduler

This option is recommended to test integration of NaaVRE with Jupyter Hub or Keycloak.

Start extra services (optional)

To test the integration of extra services, run:

tilt enable [n-a-a-vre-dev hub proxy user-placeholder user-scheduler] minio traefik square-root-v3 square-root-v2 kube-prometheus-stack-server kube-prometheus-stack-alertmanager

Resetting the dev environment

To reset the services, exit tilt and run tilt down. To fully reset the minikube cluster, run minikube delete.

Access the services

The dev services are only accessible locally, using the domain name naavre-dev.minikube.test (provided minikube ingress-dns was setup). That allows us to use insecure credentials to login to the services.

Keycloak

https://naavre-dev.minikube.test/auth/

Account Username Password
Superuser (master realm) admin admin
User (vre realm) user user

Argo

https://naavre-dev.minikube.test/argowf/

Login through keycloak.

Account Token
vre-api service account Dynamic

K8s-secret-creator

https://naavre-dev.minikube.test/k8s-secret-creator/1.0.0/ui/

Account Token
Token authentication token_ksc

VREPaaS

UI: https://naavre-dev.minikube.test/vreapp

Login through keycloak.

Admin interface: https://naavre-dev.minikube.test/vre-api-test/admin/

Account Username Password Token
Administrator admin admin
API user user user token_vreapi

NaaVRE-dev

https://naavre-dev.minikube.test/n-a-a-vre-dev

No authentication.

This version of NaaVRE runs Jupyter Lab alone (i.e. without Jupyter Hub), and updates automatically when the NaaVRE code is changed. It is suited for testing NaaVRE features, but not for testing integration (in that case, see NaaVRE section below).

NaaVRE-integration

https://naavre-dev.minikube.test/n-a-a-vre-integration/

Login through keycloak.

This version of NaaVRE is controlled by Jupyter Hub, and is closer to the actual deployed version. However, it will not update automatically.

To show changes to the NaaVRE component in Tilt:

This is necessary because the Jupyter Lab pod is started dynamically by Jupyter Hub, which prevents Tilt from detecting when it should reload it. It is usually not necessary to reload the NaaVRE/hub and proxy resources, even if Tilt says it has changes.

Canary example

http://naavre-dev.minikube.test/square-root/

This is a simple example showing how to do canary deployments. To test it enable both versions of the square-root-v2 and square-root-v3 services.

tilt enable square-root-v3 square-root-v2

To see that 50% of the requests are going to each version access open your browser to http://naavre-dev.minikube.test/square-root/4 and send 10 requests (press f5 10 times).

Open the tilt dashboard and check the 'canary-example' resources logs of the services to see that the requests are being distributed between the two versions.

To change the percentage of requests going to each version change the values in the services/canary-example/canary-example-canary.yaml file in the Ingress look for the nginx.ingress.kubernetes.io/canary-weight annotation and change the values to the desired

Grafana & Prometheus

Enable the metrics server:

minikube addons enable metrics-server

UI: https://naavre-dev.minikube.test/grafana/

UI: https://naavre-dev.minikube.test/prometheus/

Account Username Password Token
Administrator admin prom-operator

If you have enabled the nginx ingress monitoring you can check the 'ingress-nginx-endpoints' target in the Prometheus dashboard: https://naavre-dev.minikube.test/prometheus/targets

Also, you can import the grafana dashboard to monitor the ingress controller.

Flagger

Kubernetes meshProvider

You can enable canary deployments with flagger. The instructions are take from by following the instructions here.

Open the values.yaml file in the services/flagger/helm folder and make sure that the meshProvider is set to kubernetes:

meshProvider: kubernetes

Install the podinfo application:

helm upgrade -i podinfo podinfo/podinfo -f services/podinfo/helm/values.yaml --create-namespace -n test

Create the ServiceMonitor, MetricTemplate and the Canary resources by running:

kubectl apply -f services/podinfo/k8s-provider/

Check the podinfo tag version at https://naavre-dev.minikube.test/podinfo/ (should be 6.6.3 )and the metrics at http://naavre-dev.minikube.test/podinfo/metrics

If the canary deployment is deployed correctly you should see in the test namespace three services:

The podinfo service is the main service that will be used to access the podinfo application. The podinfo and podinfo-primary services are using the podinfo-primary pod. The podinfo-canary service is not using any pod.

Flagger will determine if the canary deployment is healthy and if it is it will promote it to the primary deployment. To test it change the image's tag:

helm upgrade -i podinfo podinfo/podinfo -f services/podinfo/helm/values-update.yaml --create-namespace -n test

To watch the carnage in the Canary resources run:

watch kubectl get canaries --all-namespaces

To run everything on one command:

helm uninstall podinfo -n test ; kubectl delete -f services/podinfo/k8s-provider/ ; sleep 20 ; helm upgrade -i podinfo podinfo/podinfo -f services/podinfo/helm/values.yaml --create-namespace -n test && kubectl apply -f services/podinfo/k8s-provider/; sleep 35s ;  helm upgrade -i podinfo podinfo/podinfo -f services/podinfo/helm/values-update.yaml --create-namespace -n test

To check the events of the podinfo canary:

kubectl describe canary podinfo -n test

If the canary deployment is successful you should see events si,iar to this:

Events:
  Type     Reason  Age                    From     Message
  ----     ------  ----                   ----     -------
  Warning  Synced  7m28s                  flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
  Normal   Synced  6m58s (x2 over 7m28s)  flagger  all the metrics providers are available!
  Normal   Synced  6m58s                  flagger  Initialization done! podinfo.test
  Normal   Synced  6m28s                  flagger  New revision detected! Scaling up podinfo.test
  Normal   Synced  5m58s                  flagger  Starting canary analysis for podinfo.test
  Normal   Synced  5m58s                  flagger  Advance podinfo.test canary iteration 1/10
  Normal   Synced  5m28s                  flagger  Advance podinfo.test canary iteration 2/10
  Normal   Synced  4m58s                  flagger  Advance podinfo.test canary iteration 3/10
  Normal   Synced  4m28s                  flagger  Advance podinfo.test canary iteration 4/10
  Normal   Synced  3m58s                  flagger  Advance podinfo.test canary iteration 5/10
  Normal   Synced  28s (x6 over 3m28s)    flagger  (combined from similar events): Copying podinfo.test template spec to podinfo-primary.test

Finally, you can check the version of the podinfo application at https://naavre-dev.minikube.test/podinfo/ (it should be 6.7.0)

Nginx meshProvider

Before installing Flagger make sure you have enabled Nginx monitoring according to nginx ingress monitoring.

Open the values.yaml file in the services/flagger/helm folder and make sure that the meshProvider is set to nginx:

meshProvider: nginx

Run evrything on one command:

helm uninstall podinfo -n test ; kubectl delete -f services/podinfo/nginx-provider/ ; sleep 20 ; helm upgrade -i podinfo podinfo/podinfo -f services/podinfo/helm/values.yaml --create-namespace -n test && kubectl apply -f services/podinfo/nginx-provider/; sleep 35; helm upgrade -i podinfo podinfo/podinfo -f services/podinfo/helm/values-update.yaml --create-namespace -n test

if successful you should see the following events:

Events:
  Type     Reason  Age                   From     Message
  ----     ------  ----                  ----     -------
  Warning  Synced  3m15s                 flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
  Normal   Synced  3m5s (x2 over 3m15s)  flagger  all the metrics providers are available!
  Normal   Synced  3m5s                  flagger  Initialization done! podinfo.test
  Normal   Synced  2m35s                 flagger  New revision detected! Scaling up podinfo.test
  Normal   Synced  2m25s                 flagger  Starting canary analysis for podinfo.test
  Normal   Synced  2m25s                 flagger  Pre-rollout check acceptance-test passed
  Normal   Synced  2m25s                 flagger  Advance podinfo.test canary weight 5
  Warning  Synced  105s (x4 over 2m15s)  flagger  Halt advancement no values found for nginx metric request-success-rate probably podinfo.test is not receiving traffic: running query failed: no values found
  Normal   Synced  95s                   flagger  Advance podinfo.test canary weight 10
  Normal   Synced  85s                   flagger  Advance podinfo.test canary weight 15
  Normal   Synced  75s                   flagger  Advance podinfo.test canary weight 20
  Normal   Synced  5s (x7 over 65s)      flagger  (combined from similar events): Copying podinfo.test template spec to podinfo-primary.test

Development cycle

The different components of NaaVRE have their own Git repositories, which are included as submodules of the NaaVRE-dev-environment repository. In the context of the dev repo, these submodules are references to a commit in the component repo. When in root directory of this repo, git commands apply to the NaaVRE-dev-environment repo. When in the submodule directory, git commands apply to the submodule repo.

For any development task, follow this cycle:

  1. On GitHub, create an issue in the appropriate component repository

  2. Create a branch linked to the issue (e.g. nnn-my-branch)

  3. Checkout this branch in the submodule:

    cd services/component/submodules/COMPONENT
    git fetch origin
    git checkout nnn-my-branch
  4. Edit code in the submodule while checking the changes with Tilt

    Note: During development, running git status in the NaaVRE-dev-environment root directory will show unstaged changes to the submodule, such as modified: submodule/COMPONENT (untracked content) or (new commits).

  5. Commit and push changes from the submodule directory

  6. On GitHub, create a pull request in the submodule repo

  7. Once it is merged:

    • In the submodule directory, switch back to the main branch and pull the latest changes
    • In the NaaVRE-dev-environment directory, stage and commit the changes to the submodule. An appropriate commit message would be “update COMPONENT ref merging COMPONENT/nnn-my-branch”

Troubleshooting

Context deadline exceeded when pulling NaaVRE image

If you get an error similar to Failed to pull image "qcdis/n-a-a-vre-laserfarm:v2.0-beta": rpc error: code = Unknown desc = context deadline exceeded in the continuous-image-puller logs:

Extra services

Minio

Admin interface: http://127.0.0.1:9001/

Account Username Password Token
Administrator admin password

Velero

Before installing Velero, you need to create an access key and bucket. To create the access kay and bucket access the Minio UI (http://127.0.0.1:9001/).

Access key

Create an access key with the following id and secret:

aws_access_key_id = minio
aws_secret_access_key = minio123

After creating the key you need to specify its Access Key Policy to allow Velero to access the bucket naavre-dev.minikube.test:

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:ListBucketMultipartUploads"
   ],
   "Resource": [
    "arn:aws:s3:::naavre-dev.minikube.test"
   ]
  },
  {
   "Effect": "Allow",
   "Action": [
    "s3:AbortMultipartUpload",
    "s3:DeleteObject",
    "s3:GetObject",
    "s3:ListMultipartUploadParts",
    "s3:PutObject"
   ],
   "Resource": [
    "arn:aws:s3:::naavre-dev.minikube.test/*"
   ]
  }
 ]
}

Bucket

Create a bucket named naavre-dev.minikube.test and add the following access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "minio"
                ]
            },
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads"
            ],
            "Resource": [
                "arn:aws:s3:::naavre-dev.minikube.test"
            ]
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "minio"
                ]
            },
            "Action": [
                "s3:DeleteObject",
                "s3:GetObject",
                "s3:ListMultipartUploadParts",
                "s3:PutObject",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::naavre-dev.minikube.test/*"
            ]
        }
    ]
}

Install Velero

Follow the instructions here to install Velero.

velero install --provider aws --use-node-agent --plugins velero/velero-plugin-for-aws:v1.2.1 \
--bucket naavre-dev.minikube.test --secret-file ./services/velero/credentials-velero --backup-location-config \
region=minio,s3ForcePathStyle="true",s3Url=http://host.minikube.internal:9000

Backup and restore

To backup a namespace:

velero backup create default-ns-backup --default-volumes-to-fs-backup --include-namespaces default --wait

You must apply an annotation to every pod which contains volumes for Velero to use FSB for the backup. For keycloak, we have annotated postgresql in helm_config/keycloak/values.yaml:

postgresql:
  enabled: true
  auth:
    postgresPassword: fake_postgres_password
    password: fake_password
  annotations:
    backup.velero.io/backup-volumes: pvc-volume,emptydir-volume

of

singleuser:
  extraAnnotations:
    backup.velero.io/backup-volumes: pvc-volume,emptydir-volume
  cmd: ['/usr/local/bin/start-jupyter-venv.sh']

To restore a namespace:

velero restore create --from-backup default-ns-backup

Simulate a disaster

Find the container running Minikube:

docker ps | grep k8s-provider-minikube

Access the container running Minikube

docker exec -it <container-id> /bin/bash

Delete the Keycloak postgresql data directory:

    rm -r  /tmp/hostpath-provisioner/default/data-keycloak-postgresql-0/

Go to https://naavre-dev.minikube.test/auth/. If the DB is missing, you won't be able to log in or you will get an error message:

Unexpected Application Error!
Network response was not OK.

NetworkError@https://naavre-dev.minikube.test/auth/resources/9qowb/admin/keycloak.v2/assets/index-d73da1a7.js:67:43535
fetchWithError@https://naavre-dev.minikube.test/auth/resources/9qowb/admin/keycloak.v2/assets/index-d73da1a7.js:67:43710

Restore the namespace:

velero restore create --from-backup default-ns-backup --wait

Try again to log in to Keycloak.