aicoe-aiops / ocp-ci-analysis

Developing AI tools for developers by leveraging the data made openly available by OpenShift and Kubernetes CI platforms.
https://old.operate-first.cloud/data-science/ai4ci/
GNU General Public License v3.0
33 stars 72 forks source link

Alternative way of retrieving testgrid and logs data #198

Open Shreyanand opened 3 years ago

Shreyanand commented 3 years ago

Is your feature request related to a problem? Please describe. As of now, we collect data by scraping https://testgrid.k8s.io/ for testgrid data and https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{tab name}/ for extracting logs. A better alternative would to get this data directly from google cloud storage through it's API. Since all of this is public data, reading it should be possible without any credentials but it seems that it is not the case.

@MichaelClifford and I tried two things:

1) Something similar to this with api_endpoint="https://storage.googleapis.com", project='openshift-gce-devel-ci' but we got this error -> "ValueError: Anonymous credentials cannot be refreshed."

2) We looked at how tesgrid collects the data on its backend (relevant files: 1, 2) but it seems even they have some authorization credentials.

Collecting from google storage seems a better option as it removes the dependency of the website that is being scraped. The credentials part needs to be looked into more.

chauhankaranraj commented 3 years ago

IIUC the goal here is to read logs and artifacts (for example, the ones over here) directly from GCS, right? I'm able to download the build logs using the following snippet without any errors:

from google.cloud import storage

def download_public_file(bucket_name, source_blob_name, destination_file_name):
    """Downloads a public blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client.create_anonymous_client()

    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Downloaded public blob {} from bucket {} to {}.".format(
            source_blob_name, bucket.name, destination_file_name
        )
    )

download_public_file(
    'origin-ci-test',
    'logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade/1377477249040650240/build-log.txt',
    'buildlogs.txt',
)

The downloaded file: Screenshot from 2021-04-01 10-51-17

Would this solve the issue you guys are running into? Or have I misunderstood the problem at hand?

MichaelClifford commented 3 years ago

@Shreyanand, I looked into these examples and used the create_anonymous_client() seems to work here for accessing info about the bucket.

storage_client = storage.Client.create_anonymous_client()
Shreyanand commented 3 years ago

@MichaelClifford This looks good! Should we update this in the recent PR? Also @chauhankaranraj, can we get the testgrid pass/fail data too from this process?