catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
479 stars 110 forks source link

Prototype PUDL use of existing DB documentation tools #3746

Closed zaneselvans closed 1 month ago

zaneselvans commented 3 months ago

Overview

We currently produce our own data dictionary for PUDL using Python, some Jinja templates, and the Sphinx documentation build. With 200+ tables and 3000+ columns, the output is overwhelming and harder to navigate than it should be.

Documenting a database schema seems like a common task that there should already be tools for. Find some free and/or open opions and explore them for potential use replacing our not particularly ergonomic data dictionary.

Searching around a bit, the tooling seems pretty sparse, at least on the free-and-open side of things. Some options I've found include:

The DBMS Tools site has an almost bewildering catalog of database tools, including:

They let you filter results by the DB type, but they don't have DuckDB as an option, and many of the tools are older and have been discontinued.

### Next steps
- [x] Create [example PUDL deployment](https://dbdocs.io/zaneselvans/pudl) using DB Docs
- [x] Make sure that table & column comments are included in DuckDB schema
- [ ] Create example PUDL DB docs using SchemaSpy via Docker
zaneselvans commented 3 months ago
jdangerx commented 3 months ago

The idea here is that this would dovetail with rolling out Superset to the masses? Some sort of "now we have a new tool, and there's a nice way to understand the data for once" combo?

zaneselvans commented 3 months ago

@jdangerx I think this meshes both with the Google Season of Docs work that @Nancy9ice is working on, and the Superset rollout -- the data dictionary page is quite unwieldy at this point, and folks need a better searchable / hierarchical system for understanding what all is available. This has come up in the docs overhaul planning, but I don't think we want to invest a bunch of time in building our own bespoke database schema explorer / docs pages that we need to make sure get updated and maintained.

For Superset it'd be great if it was built in to the platform, but also we probably want this kind of documentation to exist alongside the normal docs. It seems strange to me that there's no off-the-shelf open source database schema to documentation website conversion tool.

jdangerx commented 1 month ago

We did some research and are going to move forward with some lightweight client-side stuff for now - see https://github.com/orgs/catalyst-cooperative/discussions/3851