abcxyz / github-metrics-aggregator

Apache License 2.0
11 stars 3 forks source link

(Leech) Logs file type in gcs does not match file extension. #141

Open pdewilde opened 9 months ago

pdewilde commented 9 months ago

TL;DR

Logs are being saved in gcs with the file extension .tar.gz, but the archives are actually zip files. The file extension should either be updated to .zip or archives should be compressed to .tar.gz format and existing files should be re-compressed.

Observed behavior

$ gunzip  logs.tar.gz 
gzip: logs.tar.gz has more than one entry -- unchanged

$ file logs.tar.gz 
logs.tar.gz: Zip archive data, at least v2.0 to extract, compression method=deflate

$ unzip logs.tar.gz
Archive:  logs.tar.gz
pdewilde commented 9 months ago

https://github.com/abcxyz/github-metrics-aggregator/blob/a875b5c48b6915df6c8d4e7acd4427376e8e72ca/pkg/leech/ingest_logs.go#L143

Seems like we hardcoded the gcs filename and therefore ignore the extension of the log file we download from GitHub

pdewilde commented 8 months ago

Seems like there may be a bit more complication than I thought. We say we will accept "application/vnd.github+json", but unless we request the gzip encoding, my understanding is that the body should be transparently uncompressed by the http client.

There are a few options I need to look into:

  1. Specifying the content type means that the go http lib assumes whatever I get back is what I want, even if the response headers specify a zip transport encoding.
  2. Somehow we are compressing via zip during the upload process.
  3. GCS is compressing for us, but not in a transparent way.

I'll need to get some github credentials to reproduce the actual http requests locally to figure out what exactly is going on.

pdewilde commented 8 months ago

https://superuser.blog/golang-http-gzip-compression

TODO: read that

sethvargo commented 8 months ago

GCS will apply gzip compression for transit if the client accepts it. Here's an example of writing a tgz object to GCS.

pdewilde commented 8 months ago

OK, then I'm suspecting that its the body from the GitHub api that is getting zipped but its not getting unzipped by the http client for some reason, I'll have to take a closer look.

I wouldn't expect that as the content-type we said we accepted was a json type, not application/zip

sethvargo commented 8 months ago

It's pretty nuanced, but https://cloud.google.com/storage/docs/transcoding. Content-Encoding is probably more relevant here. Similarly, Accept and Accept-Encoding.