StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
68 stars 12 forks source link

Notebook Security Scanning: Investigate cleaning remote repository cache #978

Closed Jose-Matsuda closed 2 years ago

Jose-Matsuda commented 2 years ago

EPIC: https://github.com/StatCan/daaas/issues/461

INVESTIGATE UPDATING XRAY OURSELVES

CURRENT STATUS 22/08/2022

More info on the steps relating to XRAY

See last comment for pictures on whats going on in the pvc in terms of what folders are the big ones


Reasoning:

With us proxying the ACR via a remote repository, any downloaded artifacts are of course kept in the cache. This cache is limited, and it would be nice to delete the image from the cache after it has been downloaded (and been scanned)

image

This should take place after the pull has completed before going on to the next pull (Artifactory must also be configured to BLOCK downloads of unscanned / critical artifacts).

Some info https://github.com/StatCan/daaas/issues/960#issuecomment-1080965832

Jose-Matsuda commented 2 years ago

It doesn't seem like this is the exact spot that we need to be concerned about, but instead the worry might instead be our pvc named data-volume-jfrog-platform-xray-0 in the jfrog-system namespace which has a claim to a 2000Gi volume. Recently, we have had to resize the pvc to be even larger twice in the past 2 months (from 1.6 on around mid June to 2 in July).

While it is worth noting that the 'notebook cleanup' job has been running daily on dev for the past 4 months, upon looking at logs it almost seems as if it is not uniq'ing properly, possibly causing the same image to be pulled multiple times. See the following

94
Comparing and outputting a list of vulnerable images in the cluster------------
93
jupyterlab-cpu;v1
92
jupyterlab-cpu;v1

Where this is supposed to be a uniq. I will test this (as in this a script locally in dev and see what results I get, if I get the multiple of jupyterlab-cpu then something is clearly wrong.

Jose-Matsuda commented 2 years ago

Ok running that script I get image (so it should be working just fine)

NVM I was looking at the wrong part of the log, in the downloaded log it does have

List of uniqe notebook images present in cluster-------------------------------
k8scc01covidacr.azurecr.io/jupyterlab-cpu:16b01881
k8scc01covidacr.azurecr.io/jupyterlab-cpu:v1
k8scc01covidacr.azurecr.io/remote-desktop:v1
k8scc01covidacr.azurecr.io/sas:latest
k8scc01covidacrdev.azurecr.io/jupyterlab-cpu:a60a0260
k8scc01covidacrdev.azurecr.io/jupyterlab-cpu:edb8ab7c
Jose-Matsuda commented 2 years ago

Reading into the specs of Artifactory / XRAY

Trying to glean any information that may be relevant to our disk getting full, will update this. Artifactory / XRAY Requirements, these do not appear to apply to this case as we don't work with a large amount of stored artifacts. According to Monitoring -> Storage we have around 116GB of total artifacts.

Looking at the JFrog Xray service and it notes these 5 microservices; these are all found in the containers of the jfrog-platform-xray-0 pod.

They all mount to the data-volume image

They all mount to /var/opt/jfrog/xray from data-volume (rw) with the exception of Router which mounts to /var/opt/jfrog/router from data-volume (rw) This /var/opt/jfrog/xray is the important bit that fills up

According to this article by jfrog, "Xray needs two databases to store its scan history and vulnerabilities."

Jose-Matsuda commented 2 years ago

Garbage Collection of XRAY

According to the bottom of this page

"Starting from Xray 3.26.1, Xray's Garbage Collector (GC) feature enables you to avoid race conditions between delete/create events sent by Artifactory mainly when moving Artifacts and promoting images. This feature is active by default and is configurable in the Xray System YAML deleteMode (‘gc’/‘eager’) parameter.

You can manage the Garbage Collector through a set of REST APIs, such as getting the GC status or forcing GC to run. For more information, see Garbage Collector (GC) REST APIs."

Jose-Matsuda commented 2 years ago

Running the following on 17/08/2022 11:48AM

URL="https://jfrog.aaw.cloud.statcan.ca/xray/api/v1/gc/status"

APIKEY=NO
JFROG_USERNAME=WAY

curl -u $JFROG_USERNAME:$APIKEY $URL # > output.txt
{"is_running":false,"last_time_started":"2022-08-17T15:00:00Z","last_time_ended":"2022-08-17T15:00:00Z","last_successful_run":"2022-08-17T15:00:00Z","last_state":"succeeded"}

And then running the configuration get...

#!/bin/bash

URL="https://jfrog.aaw.cloud.statcan.ca/xray/api/v1/configuration/gc"

# generate a new one
APIKEY=
JFROG_USERNAME=

curl -u $JFROG_USERNAME:$APIKEY $URL > gc-configuration.txt
{"scheduler_enabled":true,"scheduler_period_minutes":120,"max_duration_seconds":180,"max_retry_count":3,"idle_listener_enabled":true,"idle_listener_gc_duration_seconds":10,"idle_listener_sampling_rate_seconds":5}

From these two that actually get information on the Garbage collection they run successfully and semi-often, of note is that right now the storage logs read Total disk space: 1967.6GB, available disk space: 232.0GB

Running the next day 18/08/2022

jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ ./request.sh 

{"scheduler_enabled":true,"scheduler_period_minutes":120,"max_duration_seconds":180,"max_retry_count":3,"idle_listener_enabled":true,"idle_listener_gc_duration_seconds":10,"idle_listener_sampling_rate_seconds":5}

jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ ./request.sh 

{"is_running":false,"last_time_started":"2022-08-18T14:00:00Z","last_time_ended":"2022-08-18T14:00:00Z","last_successful_run":"2022-08-18T14:00:00Z","last_state":"succeeded"}jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ 

And according to xray-indexer now at Total disk space: 1967.6GB, available disk space: 214.6GB, almost 20Gb down from yesterday, note that on the 17/08/2022 I had also removed the acr from being indexed. image

Running again today 8:30 19/08/2022

{"is_running":false,"last_time_started":"2022-08-19T12:00:00Z","last_time_ended":"2022-08-19T12:00:00Z","last_successful_run":"2022-08-19T12:00:00Z","last_state":"succeeded"}

Today the xray-indexer is now at Total disk space: 1967.6GB, available disk space: 214.5GB, note that yesterday I did change the image scanning job to be weekly instead of nightly.

Checking again after the weekend (22/08/2022)

xray-indexer now at 179.6GB`

Checking AGAIN (24/08/2022)

xray-indexer now at 162.1GB

Possible Next step

Look at the next comment "XRAY Indexing of Repos" as we may be able to configure say specifically for the remote docker image repo and make it shorter than the default 90d

Jose-Matsuda commented 2 years ago

XRAY Indexing of Repos

This "Set a Retention Period" might be useful, if by default it is 90d if we can change that to say 30d maybe it will clean up some of the volume.

There is a REST option

HOWEVER, this requires XRAY 3.41.4

We are on 3.26.1

Jose-Matsuda commented 2 years ago

Screenshots (will update) image

image

image

image

Finally some divergence

image

vulnerability

ls'ing this directory appears to have a bunch of zip and __vuln files.

component

ls'ing this directory appears to have a bunch of zip files that follow some pattern image and well if these are truly timestamps, it appears like there is no "clearing" of any sort of this data, 90 days before today (25/08/2022) is (27/05/2022)

The files are of varying sizes image, the max size sits at around 443M and the lowest at just 3.3

Jose-Matsuda commented 2 years ago

Concerning the age of these zip files, there appear to be some files from as early as October 2021 image

The contents appear to be json files image

The contents of the JSON file I cannot seem to unzip image

Tried on my personal to see unzip -p and it should work image

have the same permissions as well image

Jose-Matsuda commented 2 years ago

Found a source that xray/data/updates/component might just be information on say package a and if the vulnerabilities on it. <-- actually a bit different from the path above.

Jose-Matsuda commented 2 years ago

After a bit of image

as a temporary fix we seem to have freed up a lot of space, image