Closed Jose-Matsuda closed 2 years ago
It doesn't seem like this is the exact spot that we need to be concerned about, but instead the worry might instead be our pvc named data-volume-jfrog-platform-xray-0
in the jfrog-system
namespace which has a claim to a 2000Gi volume. Recently, we have had to resize the pvc to be even larger twice in the past 2 months (from 1.6 on around mid June to 2 in July).
While it is worth noting that the 'notebook cleanup' job has been running daily on dev for the past 4 months, upon looking at logs it almost seems as if it is not uniq
'ing properly, possibly causing the same image to be pulled multiple times.
See the following
94
Comparing and outputting a list of vulnerable images in the cluster------------
93
jupyterlab-cpu;v1
92
jupyterlab-cpu;v1
Where this is supposed to be a uniq. I will test this (as in this a script
locally in dev and see what results I get, if I get the multiple of jupyterlab-cpu then something is clearly wrong.
Ok running that script I get (so it should be working just fine)
NVM I was looking at the wrong part of the log, in the downloaded log it does have
List of uniqe notebook images present in cluster-------------------------------
k8scc01covidacr.azurecr.io/jupyterlab-cpu:16b01881
k8scc01covidacr.azurecr.io/jupyterlab-cpu:v1
k8scc01covidacr.azurecr.io/remote-desktop:v1
k8scc01covidacr.azurecr.io/sas:latest
k8scc01covidacrdev.azurecr.io/jupyterlab-cpu:a60a0260
k8scc01covidacrdev.azurecr.io/jupyterlab-cpu:edb8ab7c
Trying to glean any information that may be relevant to our disk getting full, will update this.
Artifactory / XRAY Requirements, these do not appear to apply to this case as we don't work with a large amount of stored artifacts. According to Monitoring -> Storage
we have around 116GB of total artifacts.
Looking at the JFrog Xray service and it notes these 5 microservices; these are all found in the containers of the jfrog-platform-xray-0
pod.
They all mount to the data-volume
They all mount to /var/opt/jfrog/xray from data-volume (rw)
with the exception of Router
which mounts to /var/opt/jfrog/router from data-volume (rw)
This /var/opt/jfrog/xray
is the important bit that fills up
According to this article by jfrog, "Xray needs two databases to store its scan history and vulnerabilities."
According to the bottom of this page
"Starting from Xray 3.26.1, Xray's Garbage Collector (GC) feature enables you to avoid race conditions between delete/create events sent by Artifactory mainly when moving Artifacts and promoting images. This feature is active by default and is configurable in the Xray System YAML deleteMode (‘gc’/‘eager’) parameter.
You can manage the Garbage Collector through a set of REST APIs, such as getting the GC status or forcing GC to run. For more information, see Garbage Collector (GC) REST APIs."
Running the following on 17/08/2022 11:48AM
URL="https://jfrog.aaw.cloud.statcan.ca/xray/api/v1/gc/status"
APIKEY=NO
JFROG_USERNAME=WAY
curl -u $JFROG_USERNAME:$APIKEY $URL # > output.txt
{"is_running":false,"last_time_started":"2022-08-17T15:00:00Z","last_time_ended":"2022-08-17T15:00:00Z","last_successful_run":"2022-08-17T15:00:00Z","last_state":"succeeded"}
And then running the configuration get...
#!/bin/bash
URL="https://jfrog.aaw.cloud.statcan.ca/xray/api/v1/configuration/gc"
# generate a new one
APIKEY=
JFROG_USERNAME=
curl -u $JFROG_USERNAME:$APIKEY $URL > gc-configuration.txt
{"scheduler_enabled":true,"scheduler_period_minutes":120,"max_duration_seconds":180,"max_retry_count":3,"idle_listener_enabled":true,"idle_listener_gc_duration_seconds":10,"idle_listener_sampling_rate_seconds":5}
From these two that actually get information on the Garbage collection they run successfully and semi-often, of note is that right now the storage logs read Total disk space: 1967.6GB, available disk space: 232.0GB
jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ ./request.sh
{"scheduler_enabled":true,"scheduler_period_minutes":120,"max_duration_seconds":180,"max_retry_count":3,"idle_listener_enabled":true,"idle_listener_gc_duration_seconds":10,"idle_listener_sampling_rate_seconds":5}
jose@w-matsujo-1:~/Documents/Work/xray-api-tests$ ./request.sh
{"is_running":false,"last_time_started":"2022-08-18T14:00:00Z","last_time_ended":"2022-08-18T14:00:00Z","last_successful_run":"2022-08-18T14:00:00Z","last_state":"succeeded"}jose@w-matsujo-1:~/Documents/Work/xray-api-tests$
And according to xray-indexer
now at Total disk space: 1967.6GB, available disk space: 214.6GB
, almost 20Gb down from yesterday, note that on the 17/08/2022 I had also removed the acr from being indexed.
{"is_running":false,"last_time_started":"2022-08-19T12:00:00Z","last_time_ended":"2022-08-19T12:00:00Z","last_successful_run":"2022-08-19T12:00:00Z","last_state":"succeeded"}
Today the xray-indexer
is now at Total disk space: 1967.6GB, available disk space: 214.5GB
, note that yesterday I did change the image scanning job to be weekly instead of nightly.
xray-indexer
now at 179.6GB`
xray-indexer
now at 162.1GB
Look at the next comment "XRAY Indexing of Repos" as we may be able to configure say specifically for the remote docker image repo and make it shorter than the default 90d
This "Set a Retention Period" might be useful, if by default it is 90d if we can change that to say 30d maybe it will clean up some of the volume.
There is a REST option
We are on 3.26.1
vulnerability
ls
'ing this directory appears to have a bunch of zip and __vuln
files.
component
ls
'ing this directory appears to have a bunch of zip files that follow some pattern
and well if these are truly timestamps, it appears like there is no "clearing" of any sort of this data, 90 days before today (25/08/2022) is (27/05/2022)
The files are of varying sizes , the max size sits at around 443M and the lowest at just 3.3
Concerning the age of these zip files, there appear to be some files from as early as October 2021
The contents appear to be json files
The contents of the JSON file I cannot seem to unzip
Tried on my personal to see unzip -p
and it should work
have the same permissions as well
Found a source that xray/data/updates/component might just be information on say package a and if the vulnerabilities on it. <-- actually a bit different from the path above.
After a bit of
as a temporary fix we seem to have freed up a lot of space,
EPIC: https://github.com/StatCan/daaas/issues/461
INVESTIGATE UPDATING XRAY OURSELVES
CURRENT STATUS 22/08/2022
remote proxy of acr has been removed from the index 17/08/2022, this did not appear to have a big effect on the
xray-indexer
storage, as the next day 18/08/2022 the space available went down 20Gbthe scanning of images on dev has been updated (18/08/2022) to be on Saturdays instead of every day, this seems to have had an impact as checking on 19/08/2022 the available disk space has not gone down much, at most 200 Mb
After we upgrade to XRAY 3.41.4 we might be able to mitigate this by setting a retention period.expanded in this comment w/ relevant links
[x] To check, see if on Monday 22/08/2022 if the available disk space according to the
xray-indexer
logs have gone down by ~20GbI have changed it to be twice a year, I will note that I find it odd that we only encountered this now when the notebook scanning job has been running for a longer time
There does not appear to be much for us to do unless we upgrade. We can wait for it to clear itself out a bit first
There might be some huge temporary files that need to be cleaned up
More info on the steps relating to XRAY
See last comment for pictures on whats going on in the pvc in terms of what folders are the big ones
Reasoning:
With us proxying the ACR via a remote repository, any downloaded artifacts are of course kept in the cache. This cache is limited, and it would be nice to delete the image from the cache after it has been downloaded (and been scanned)
This should take place after the pull has completed before going on to the next pull (Artifactory must also be configured to BLOCK downloads of unscanned / critical artifacts).
Some info https://github.com/StatCan/daaas/issues/960#issuecomment-1080965832