Notebook Security Scanning: Investigate cleaning remote repository cache

Jose-Matsuda commented 2 years ago

EPIC: https://github.com/StatCan/daaas/issues/461

INVESTIGATE UPDATING XRAY OURSELVES

CURRENT STATUS 22/08/2022

remote proxy of acr has been removed from the index 17/08/2022, this did not appear to have a big effect on the xray-indexer storage, as the next day 18/08/2022 the space available went down 20Gb
the scanning of images on dev has been updated (18/08/2022) to be on Saturdays instead of every day, this seems to have had an impact as checking on 19/08/2022 the available disk space has not gone down much, at most 200 Mb
After we upgrade to XRAY 3.41.4 we might be able to mitigate this by setting a retention period.expanded in this comment w/ relevant links
[x] To check, see if on Monday 22/08/2022 if the available disk space according to the xray-indexer logs have gone down by ~20Gb
I have changed it to be twice a year, I will note that I find it odd that we only encountered this now when the notebook scanning job has been running for a longer time
There does not appear to be much for us to do unless we upgrade. We can wait for it to clear itself out a bit first
There might be some huge temporary files that need to be cleaned up

More info on the steps relating to XRAY

See last comment for pictures on whats going on in the pvc in terms of what folders are the big ones

Reasoning:

With us proxying the ACR via a remote repository, any downloaded artifacts are of course kept in the cache. This cache is limited, and it would be nice to delete the image from the cache after it has been downloaded (and been scanned)

This should take place after the pull has completed before going on to the next pull (Artifactory must also be configured to BLOCK downloads of unscanned / critical artifacts).

Some info https://github.com/StatCan/daaas/issues/960#issuecomment-1080965832

Jose-Matsuda commented 2 years ago

It doesn't seem like this is the exact spot that we need to be concerned about, but instead the worry might instead be our pvc named data-volume-jfrog-platform-xray-0 in the jfrog-system namespace which has a claim to a 2000Gi volume. Recently, we have had to resize the pvc to be even larger twice in the past 2 months (from 1.6 on around mid June to 2 in July).

While it is worth noting that the 'notebook cleanup' job has been running daily on dev for the past 4 months, upon looking at logs it almost seems as if it is not uniq'ing properly, possibly causing the same image to be pulled multiple times. See the following

94
Comparing and outputting a list of vulnerable images in the cluster------------
93
jupyterlab-cpu;v1
92
jupyterlab-cpu;v1

Where this is supposed to be a uniq. I will test this (as in this a script locally in dev and see what results I get, if I get the multiple of jupyterlab-cpu then something is clearly wrong.

Jose-Matsuda commented 2 years ago

Ok running that script I get (so it should be working just fine)

NVM I was looking at the wrong part of the log, in the downloaded log it does have

List of uniqe notebook images present in cluster-------------------------------
k8scc01covidacr.azurecr.io/jupyterlab-cpu:16b01881
k8scc01covidacr.azurecr.io/jupyterlab-cpu:v1
k8scc01covidacr.azurecr.io/remote-desktop:v1
k8scc01covidacr.azurecr.io/sas:latest
k8scc01covidacrdev.azurecr.io/jupyterlab-cpu:a60a0260
k8scc01covidacrdev.azurecr.io/jupyterlab-cpu:edb8ab7c

Jose-Matsuda commented 2 years ago

Reading into the specs of Artifactory / XRAY

Trying to glean any information that may be relevant to our disk getting full, will update this. Artifactory / XRAY Requirements, these do not appear to apply to this case as we don't work with a large amount of stored artifacts. According to Monitoring -> Storage we have around 116GB of total artifacts.

Looking at the JFrog Xray service and it notes these 5 microservices; these are all found in the containers of the jfrog-platform-xray-0 pod.

They all mount to the data-volume

They all mount to /var/opt/jfrog/xray from data-volume (rw) with the exception of Router which mounts to /var/opt/jfrog/router from data-volume (rw) This /var/opt/jfrog/xray is the important bit that fills up

According to this article by jfrog, "Xray needs two databases to store its scan history and vulnerabilities."

Jose-Matsuda commented 2 years ago

Garbage Collection of XRAY

According to the bottom of this page

"Starting from Xray 3.26.1, Xray's Garbage Collector (GC) feature enables you to avoid race conditions between delete/create events sent by Artifactory mainly when moving Artifacts and promoting images. This feature is active by default and is configurable in the Xray System YAML deleteMode (‘gc’/‘eager’) parameter.

You can manage the Garbage Collector through a set of REST APIs, such as getting the GC status or forcing GC to run. For more information, see Garbage Collector (GC) REST APIs."

Jose-Matsuda commented 2 years ago

Running the following on 17/08/2022 11:48AM

URL="https://jfrog.aaw.cloud.statcan.ca/xray/api/v1/gc/status"

APIKEY=NO
JFROG_USERNAME=WAY

curl -u $JFROG_USERNAME:$APIKEY $URL # > output.txt

{"is_running":false,"last_time_started":"2022-08-17T15:00:00Z","last_time_ended":"2022-08-17T15:00:00Z","last_successful_run":"2022-08-17T15:00:00Z","last_state":"succeeded"}

And then running the configuration get...

#!/bin/bash

URL="https://jfrog.aaw.cloud.statcan.ca/xray/api/v1/configuration/gc"

# generate a new one
APIKEY=
JFROG_USERNAME=

curl -u $JFROG_USERNAME:$APIKEY $URL > gc-configuration.txt

{"scheduler_enabled":true,"scheduler_period_minutes":120,"max_duration_seconds":180,"max_retry_count":3,"idle_listener_enabled":true,"idle_listener_gc_duration_seconds":10,"idle_listener_sampling_rate_seconds":5}

From these two that actually get information on the Garbage collection they run successfully and semi-often, of note is that right now the storage logs read Total disk space: 1967.6GB, available disk space: 232.0GB

Running the next day 18/08/2022

jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ ./request.sh 

{"scheduler_enabled":true,"scheduler_period_minutes":120,"max_duration_seconds":180,"max_retry_count":3,"idle_listener_enabled":true,"idle_listener_gc_duration_seconds":10,"idle_listener_sampling_rate_seconds":5}

jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ ./request.sh 

{"is_running":false,"last_time_started":"2022-08-18T14:00:00Z","last_time_ended":"2022-08-18T14:00:00Z","last_successful_run":"2022-08-18T14:00:00Z","last_state":"succeeded"}jose@w-matsujo-1:~/Documents/Work/xray-api-tests$

And according to xray-indexer now at Total disk space: 1967.6GB, available disk space: 214.6GB, almost 20Gb down from yesterday, note that on the 17/08/2022 I had also removed the acr from being indexed.

Running again today 8:30 19/08/2022

{"is_running":false,"last_time_started":"2022-08-19T12:00:00Z","last_time_ended":"2022-08-19T12:00:00Z","last_successful_run":"2022-08-19T12:00:00Z","last_state":"succeeded"}

Today the xray-indexer is now at Total disk space: 1967.6GB, available disk space: 214.5GB, note that yesterday I did change the image scanning job to be weekly instead of nightly.

Checking again after the weekend (22/08/2022)

xray-indexer now at 179.6GB`

Checking AGAIN (24/08/2022)

xray-indexer now at 162.1GB

Possible Next step

Look at the next comment "XRAY Indexing of Repos" as we may be able to configure say specifically for the remote docker image repo and make it shorter than the default 90d

Jose-Matsuda commented 2 years ago

XRAY Indexing of Repos

This "Set a Retention Period" might be useful, if by default it is 90d if we can change that to say 30d maybe it will clean up some of the volume.

There is a REST option

HOWEVER, this requires XRAY 3.41.4

We are on 3.26.1

Jose-Matsuda commented 2 years ago

Screenshots (will update)

Finally some divergence

`vulnerability`

ls'ing this directory appears to have a bunch of zip and __vuln files.

`component`

ls'ing this directory appears to have a bunch of zip files that follow some pattern and well if these are truly timestamps, it appears like there is no "clearing" of any sort of this data, 90 days before today (25/08/2022) is (27/05/2022)

The files are of varying sizes , the max size sits at around 443M and the lowest at just 3.3

Jose-Matsuda commented 2 years ago

Concerning the age of these zip files, there appear to be some files from as early as October 2021

The contents appear to be json files

The contents of the JSON file I cannot seem to unzip

Tried on my personal to see unzip -p and it should work

have the same permissions as well

Jose-Matsuda commented 2 years ago

Found a source that xray/data/updates/component might just be information on say package a and if the vulnerabilities on it. <-- actually a bit different from the path above.

Jose-Matsuda commented 2 years ago

After a bit of

as a temporary fix we seem to have freed up a lot of space,

StatCan / aaw