Revitalize the collection of PUDL usage metrics

jdangerx commented 4 months ago

Overview

In order to better trace the development of PUDL, the success of our outreach efforts and the effects of our new Superset instance, we need to revitalize the pudl-usage-metrics repository and collect usage metrics from the following sources:

S3
Datasette (until retirement)
Superset
Zenodo
Kaggle
Github

We're interested in the following types of metrics to start:

how many different IPs are accessing our data via each method?
what tables and versions of the data are people accessing?

As a first step, we should be able to ETL the logs and metrics from each of these data sources and get a weekly summary that we can look at. As a second step, we want to hook up our metrics to a private Superset dataset and build some dashboards for easy interpretation.

Out of scope

Google Analytics - we already have a dashboard we can use to look at these metrics!
Migrating out of CloudSQL to another storage backend
Setting up a permanent dagster server with sensors and schedules

Infrastructure

The pudl-usage-metrics repository hasn't been maintained for a while. We'll need to get it up to speed to support this development work.

### Infrastructure tasks
- [x] Update packages https://github.com/catalyst-cooperative/pudl-usage-metrics/pull/119
- [x] The repo was created with an older version of the cheshire template. Update the packaging methodology and pre-commit hooks. Replace setup.py with pyproject.toml and conda for mamba. https://github.com/catalyst-cooperative/pudl-usage-metrics/pull/119
- [x] Update Python version to 3.12 https://github.com/catalyst-cooperative/pudl-usage-metrics/pull/119
- [ ] https://github.com/catalyst-cooperative/pudl/pull/3841
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/175

S3 Logs

Our main programmatic access method. S3 logs are currently mirrored to a GCS bucket. Each request produces one log.

### S3 Usage Metrics
- [x] If config to mirror S3 bucket to GCS is in S3, start with getting existing code to run locally on latest S3 logs
- [x] Put in devtools and run using a GHA weekly? Or update the usage metrics repo? Decide
- [x] Clean up `pudl-usage-metrics` repo: see https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/118
- [x] Configure maximum run concurrency
- [x] Update existing code and add to `pudl-usage-metrics` repo
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/pull/140
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/147

Datasette

While we're planning to retire Datasette, it'd still be helpful to understand the history of usage and to see how usage changes during the transition to Superset. The log ETL that exists in pudl-usage-metrics hasn't worked since the transition to fly.io.

fly.io currently doesn't retain logs for a long time so we need to use the https://github.com/superfly/fly-log-shipper fly log shipper to send logs to S3.

It also doesn't log out the IP address of the datasette requests - guessing that the IP currently logged is the load balancer IP. Usually the load balancer includes some sort of "forwarded this request from original IP" information in the headers, so we should be able to extract that somehow. Seems like we can't configure the datasette access logs so we'll need to set it up behind something we can configure, like NGINX.

### Datasette Tasks
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/148
- [ ] write a script to parse logs into timestamp/IP/resource records
- [ ] compare & contrast to old Datasette logs
- [ ] Update datasette ETL to use assets rather than ops, and to conform to the format of the S3 processing

Superset

We're slowly deploying a new data visualization tool! It'll give us a lot of usage information, which we should process and handle. See https://engineering.hometogo.com/monitor-superset-usage-via-superset-c7f9fba79525 for a template.

### Superset Tasks
- [ ] See https://github.com/hometogo/hometogo-data-code-snippets/tree/main/superset/usage_dashboard

Zenodo

Zenodo API calls return stats on views and downloads for a record at a particular point in time. We should periodically (weekly?) collect stats on all of our archives on Zenodo and archive them for later processing.

### Zenodo Tasks
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/176
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/181

Kaggle

Kaggle collects data on views and downloads through its dataset metadata JSON, accessible through the api.metadata_get(KAGGLE_OWNER, KAGGLE_DATASET) call from the KaggleApi. Like Zenodo, this is data reported at the time of query, so we'll need to archive these metrics to see changes over time.

### Kaggle
- [ ] #162
- [ ] #167
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/pull/168
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/180

Github

Migrate our Github metrics archiving from the business repository, and add it to our ETL.

### Github
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/161
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/pull/168

Reporting and Visualization

Once the data is processed, we'll need to analyze and report on metrics of interest in order to interpret changes in usage and highlight trends of interest.

Some interesting references for Superset usage dashboards can be found here.

### Reporting and Summaries
- [ ] Define usage metrics of interest across data sources
- [ ] Write a script to output a weekly usage summary
- [ ] Write a GHA to produce and upload this weekly summary to GCS every week
- [ ] Hook up the database to Superset
- [ ] https://github.com/catalyst-cooperative/pudl-usage-metrics/issues/184

jdangerx commented 4 months ago

We should timebox this to 5h and prioritize getting S3 parquet logs because of the possibility of replacing datasette altogether.

bendnorman commented 3 months ago

I think revamping the pudl-usage-metrics repo will take some work. Maybe we can simplify the task by "disabling" the current metrics in the ETL:

old datasette logs from the Cloud Run days which are probably still helpful for us
intake catalog logs which we never really utilized

and integrate just the s3 logs since those are the highest value / most relevant rn. I opened a PR with my janky s3 log download script and notebook.

The ETl generally works like this:

Pull some logs from GCS
Does some cleaning with pandas and dagster
Load the cleaned logs into Cloud SQL postgres

I have a github action that processes the latest logs and loads them to Cloud SQL. Cloud SQL is kind of expensive so it might make more sense to use BQ.

I think it makes sense to create a quick design doc for the usage metrics revamp, given there is a lot we could do.

e-belfer commented 2 months ago

I've updated this issue to be an epic reflecting all our logs and possible workflows, and have tried to structure out smaller steps in the tasklists.

zaneselvans commented 1 month ago

@e-belfer was this issue supposed to get closed by #162?

e-belfer commented 1 month ago

Definitely not!

catalyst-cooperative / pudl-usage-metrics