Open jdangerx opened 4 months ago
We should timebox this to 5h and prioritize getting S3 parquet logs because of the possibility of replacing datasette altogether.
I think revamping the pudl-usage-metrics
repo will take some work. Maybe we can simplify the task by "disabling" the current metrics in the ETL:
and integrate just the s3 logs since those are the highest value / most relevant rn. I opened a PR with my janky s3 log download script and notebook.
The ETl generally works like this:
I have a github action that processes the latest logs and loads them to Cloud SQL. Cloud SQL is kind of expensive so it might make more sense to use BQ.
I think it makes sense to create a quick design doc for the usage metrics revamp, given there is a lot we could do.
I've updated this issue to be an epic reflecting all our logs and possible workflows, and have tried to structure out smaller steps in the tasklists.
@e-belfer was this issue supposed to get closed by #162?
Definitely not!
Overview
In order to better trace the development of PUDL, the success of our outreach efforts and the effects of our new Superset instance, we need to revitalize the
pudl-usage-metrics
repository and collect usage metrics from the following sources:We're interested in the following types of metrics to start:
As a first step, we should be able to ETL the logs and metrics from each of these data sources and get a weekly summary that we can look at. As a second step, we want to hook up our metrics to a private Superset dataset and build some dashboards for easy interpretation.
Out of scope
Infrastructure
The
pudl-usage-metrics
repository hasn't been maintained for a while. We'll need to get it up to speed to support this development work.S3 Logs
Our main programmatic access method. S3 logs are currently mirrored to a GCS bucket. Each request produces one log.
Datasette
While we're planning to retire Datasette, it'd still be helpful to understand the history of usage and to see how usage changes during the transition to Superset. The log ETL that exists in
pudl-usage-metrics
hasn't worked since the transition to fly.io.fly.io currently doesn't retain logs for a long time so we need to use the https://github.com/superfly/fly-log-shipper fly log shipper to send logs to S3.
It also doesn't log out the IP address of the datasette requests - guessing that the IP currently logged is the load balancer IP. Usually the load balancer includes some sort of "forwarded this request from original IP" information in the headers, so we should be able to extract that somehow. Seems like we can't configure the datasette access logs so we'll need to set it up behind something we can configure, like NGINX.
Superset
We're slowly deploying a new data visualization tool! It'll give us a lot of usage information, which we should process and handle. See https://engineering.hometogo.com/monitor-superset-usage-via-superset-c7f9fba79525 for a template.
Zenodo
Zenodo API calls return stats on views and downloads for a record at a particular point in time. We should periodically (weekly?) collect stats on all of our archives on Zenodo and archive them for later processing.
Kaggle
Kaggle collects data on views and downloads through its dataset metadata JSON, accessible through the
api.metadata_get(KAGGLE_OWNER, KAGGLE_DATASET)
call from theKaggleApi
. Like Zenodo, this is data reported at the time of query, so we'll need to archive these metrics to see changes over time.Github
Migrate our Github metrics archiving from the
business
repository, and add it to our ETL.Reporting and Visualization
Once the data is processed, we'll need to analyze and report on metrics of interest in order to interpret changes in usage and highlight trends of interest.
Some interesting references for Superset usage dashboards can be found here.