hh commented 1 year ago

First you will need the most recent successful job build_id for each job.

Here is a curl + jq shell pipeline to get you a list of 1376 of them:

curl -L https://prow.k8s.io/data.js | jq '.[] | select(.state | contains("success"))' | jq -s 'sort_by(.started)' |  jq 'unique_by(.job)' | jq length

If these are loaded into a table, for each you should be able to call:

JOB=test-infra-cfl-prune
PROW_JOB=62e6a6dc-b4e4-11ed-8f17-f698593abaaa
curl $JOB.yaml -o -L https://prow.k8s.io/prowjob?prowjob=62e6a6dc-b4e4-11ed-8f17-f698593abaaa
This yaml can be loaded per job name, though in practice the build_id is part of a reason / trigger this job was run and prow_job is a particular pod / set of job results.

With the yaml you can retrieve the .status.build_id which can be used to retrieve the spyglass web url for the logs/artifacts:

JOB_NAME=$(curl -s -L https://prow.k8s.io/prowjob?prowjob=$PROW_JOB | yq '.metadata.annotations["prow.k8s.io/job"]')
BUILD_ID=`curl -s -L https://prow.k8s.io/prowjob?prowjob=$PROW_JOB | yq .status.build_id`
curl -L https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/$JOB_NAME/$BUILD_ID

Which is a mirror of the s3 buckets, which we could possibly just retrieve via an aws s3 for direct loading.

hh commented 1 year ago

TLDR; Run this command and load the contents into a table, where the job and build_id and logfile path are keys and the contents are indexed text:

Run this first and explore the data set, some folders may be huge, let's focus on grabbing at least interesting logs.

hh commented 1 year ago

It's taking a while to run but here are the initial results, we could also just generate a list of URLS and be picky about what we import. We could update the mirror command to just output a list of all urls that need to be retrieved rather than actually retrieving.

du -sh gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/*/*/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/apiserver-network-proxy-push-images/1629276023138816000/
4.0K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/apisnoop-conformance-gate/1629257708022534144/
8.0K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/auto-refreshing-official-cve-feed/1629240343373287424/
8.0K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/auto-refreshing-official-cve-feed/1629270795232481280/
264K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/build-win-soak-test-cluster/1629647486446473216/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-csi-driver-host-path-push-images/1628533468071727104/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-csi-driver-iscsi-push-images/1629203853511495680/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-csi-driver-nfs-push-images/1628580780437409792/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-csi-driver-smb-push-images/1627807478244708352/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-csi-test-push-images/1627807478299234304/
1.3M    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-e2e-gce-cloud-provider-disabled/1629434833551757312/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-external-attacher-push-images/1627807478378926080/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-external-health-monitor-push-images/1627807478538309632/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-external-provisioner-push-images/1627807478672527360/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-external-resizer-push-images/1627807478748024832/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-external-snapshotter-push-images/1627807478840299520/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-lib-volume-populator-push-images/1628539759376732160/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-livenessprobe-push-images/1627807478940962816/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-nfs-ganesha-server-and-external-provisioner-push-images/1628224984025403392/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-nfs-subdir-external-provisioner-push-images/1628138413477597184/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-node-driver-registrar-push-images/1627807479045820416/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-sig-storage-local-static-provisioner-push-images/1627807479146483712/
 12K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/canary-volume-data-source-validator-push-images/1628539759427063808/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-1-23/1629585326815055872/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-1-24/1629367891923046400/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-1-25/1629334213750689792/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-1-26/1629409668369485824/
320K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-master/1629424012239048704/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-vmss-1-23/1629585326861193216/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-vmss-1-24/1629367891977572352/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-vmss-1-25/1629334213796827136/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-vmss-1-26/1629409668415623168/
320K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-disk-vmss-master/1629279602905976832/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-1-23/1629585326706003968/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-1-24/1629367640499687424/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-1-25/1629334213650026496/
304K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-1-26/1629409668252045312/
320K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-master/1629424012188717056/
296K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-vmss-1-23/1629585326760529920/
296K    gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/capz-azure-file-vmss-1-24/1629367891881103360/

zachmandeville commented 1 year ago

Sounds good. I will update the command to get a list of urls, or see if i can get them from the data we hae loaded so far, and then do some data exploration of some sample url's downloading their contents and poking around.

The goal, ultimately, is to see if we can find usages of the old registry in these logs, that are not in the job's definition yaml? This would tell us that we need to look at the logs like this to ensure we are not missing anything.

Are there any other checks I should be doing to validate that we are going on the right path?

zachmandeville commented 1 year ago

Made some alright progress today.

I was thinking that one of the main things we want is the build log that shows the what's being downloaded during the process. We can figure out the url for this log using the url key available for each object in that data.js output.

each object has a url that points to their spyglass, and in that is the url that points to the raw build log. Essentially, if the url is: prow.k8s.io/view/gs/somepath/$job/$build you can switch it to: storage.googleapis.com/somepath/$job/$build/build-log.txt

If we just swapped out the url in the given jq invocation, we'll end up with the build logs in some deeply nested files. And we'll still have to use some additionall logic to connect that log to the specific job and id, mainly taking the path of the file location and parsing the job id and build from it...so that we could connect it to the appropriate job in the rest of our tables.

I explored doing this in the existing prowfetch.clj work. I can filter the successful jobs, then build up a map of {:job :build_id :url :log} and push that to json. We'd then have a json file with all the stuff we need that could be dumped into a table and joined with other tables for some good querying.

At the moment, I am doing it with some chunking, as even just getting the job logs, if we tried to put it into a single json it'd be several gigabytes in size. So i am chunking the work, creating multiple jsons, and will then copy them into or job_lob table one by one.

I got to the point of batch creating these json files, but not yet into loading it into sql or solidifying it into an init script.

zachmandeville commented 1 year ago

Expanded prowfetch to grab the artifact urls and then loaded them into the new prow artifact table with this loading script

hh commented 1 year ago

There are heaps of logs, let's query and load them selectively into a new table that has indexed columns this:

TXT job name
TXT build id
TXT artifact / log path (the path to each google storage [gcs] object underneath the buildI id in the url ie: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-k8s-infra-build-cluster-prow-build-trusted/1630658000534376448/ -> build-log.txt,prow job.json,sidecar-logs.json )
JSONB data : for yaml and json objects, load them as indexed json for easy searching of the document
TXT content: for text context indexed so we can look at non structured data

Use this as an example URL to pull from https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-k8s-infra-build-cluster-prow-build-trusted/1630658000534376448/

To load that table let's progress through four artefact types:

[ ] build-log.txt for every prow job / latest successful build id. (content column)
[ ] load the json files at top level (json data column)
[ ] yaml files (json data column, not even sure there are any, but convert them into json before loading)
[ ] other log files (in particular looking for log files that are part of a cluster coming up / pulling images)

hh commented 1 year ago

Some examples across the types:

build-log.txt

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kind-network-ipvs-dual/1630904124696432640/build-log.txt

Json

hh commented 1 year ago

[ ] Create a standalone function that will take a URL w/ folder and return all non-folder URLS
[ ] Create a pmap to parallel call that function for all gcsweb.urls and load all non-folder URLs into the tables as defined above ^^

zachmandeville commented 1 year ago

After morning discussion, we came up with this alternate design

Design
- Postgres
- Side-App
Running
Second day, third day, etc.

Design

Postgres

DB and tables already set up, and running.
We also have a sql function load_job(job,build-id,jsonb)

Side-App

we load data.js and filter to successful jobs
we now have a list of jobs and their url’s, and we know that this would be recursive.
the application takes the first url,
- fetches it,
- converts to a json string,
- ???
- connects to db
- calls the load_job function

Running

both are set up and running and we leave be

Second day, third day, etc.

side app function to check latest jobs and load them
side app calls flush_old_job function

hh commented 1 year ago

Had a good meeting with @kubermatic, but we had some failures loading everything. They'd like an option to load one fully populated and another for fresh. Probably just a drop down for image selection on the coder template OR a command that injects the data rather than processing it manually at run time.

cncf-infra / infrasnoop

Add latest successful prow job logs to infrasnoop #3

build-log.txt

Json

yaml

Other Text Files

After morning discussion, we came up with this alternate design

Design

Postgres

Side-App

Running

Second day, third day, etc.