catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Deploy superset #3703

Closed bendnorman closed 1 month ago

bendnorman commented 3 months ago
### Required
- [x] Mount duck.db file as volume to cloud run
- [x] Figure out why terraform is out of sync since adding GCS as a volume in the UI
- [x] Figure out what logs we capture and how to store them
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3804
- [x] Rework the connection string to use sa.URL and pg8000
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3739
- [x] Add row limits to SQL Lab (the default 100,000 is probably sufficient)
- [x] Convert pudl.sqlite to .duckdb file. We can just use our existing script for now.
- [x] Create superset docker image and deploy it to Cloud Run or fly.io
- [x] Set up Cloud SQL database for superset backend
- [x] Iron out superset permissions. Users should not be able to add new data sources, create dashboards (for now), but have access to SQL Lab
- [x] Use postgres as backend for local development
- [x] Specify superset dockerfile version?
### Nice to haves
- [ ] csv export crashes site when trying to export 100,000 rows
- [x] Set up auth0. How can we associate auth0 accounts with admin and gamma permissions? Catalyst accounts should have the Admin permission and everyone else should have our custom "Public" role.
- [ ] Figure out how to set a default schema in SQL Lab
- [x] Publish to explore.catalyst.coop (we can use the autogenerated url for now)
- [ ] Figure out how to populate a default query in SQL Lab
- [x] Create infra resources with terraform
- [x] Rename auth0 project
- [ ] Figure out why browser automatically uses previous credentials when logging back into superset
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3802
bendnorman commented 2 months ago

Also, in the com dev meeting we decided doing superset demos during the required interviews doesn’t make a ton a sense so our July 17th deadline is moot. Do we still want to aim to have a public beta by the end of the month?

bendnorman commented 2 months ago

Creating superset user roles

Superset provides a few predefined roles: Admin, Alpha, Gamma, Public and sql_lab. We want people who register to assume the Gamma role because it has the fewest number of permissions. By default they don't have access to databases and SQL lab so we need to create a new role. To do this I exported the existing roles as a json, created a new role called GammaSQLLab with Gamma and sql_lab permissions. I also needed to add the all_database_access permissions so it could access the database. This role can query the duckdb database, create charts and dashboards but can't edit or add new data sources.

You can export and import roles using these commands in the superset container:

superset fab export-roles --path {path to write role to}.json
superset fab import-roles --path {path to edited roles}.json

Once the GammaSQLLab role is created, we can set it as the default registration role in config_supserset.py:

AUTH_USER_REGISTRATION_ROLE = "GammaSQLLab"

Mapping auth0 accounts to roles

This is our desired registration and permissions workflow:

The correct way to do this is to assign users to oauth groups and map these groups to superset roles. I can't figure out how to create and manager groups using auth0. I tried using this Authorization extension but didn't get very far. I'm sure there is a way to make this work but it's above my pay grade.

I figured out a work around for now. We can create a superset-admin@catalyst.coop user in the auth0 User Management tab. Then, we can create a new superset admin user inside the container with this command:

superset fab create-admin \
              --username 'auth0_auth0|{user_id generated by auth0}' \
              --firstname {Superset} \
              --lastname {Admin} \
              --email {email} \
              --password {password created in auth0}

Now we can log into superset with this admin account and give catalyst accounts admin or alpha permissions.

bendnorman commented 2 months ago

Now that the auth0 and permissions stuff is mostly working I'm going to move onto the hosting infrastructure.

  1. Push dockerfile to artifact registry and update cloud run instance. Can use default sqlite superset database for now.
  2. Create a Cloud SQL instance and hook it up to superset running on cloud run
  3. Copy the duckdb and superset config file to GCS bucket
  4. Run superset setup commands
zschira commented 2 months ago

@bendnorman it sounds reasonable to me to extend the timeline. Seems like a good thing to discuss during inframundo sprint planning today.

zaneselvans commented 2 months ago
bendnorman commented 1 month ago

Some notes on usage data we have access to:

The most valuable information tracked in the superset database is the queries and registered users. The query table in the superset database contains all user queries and the user id. We can use this to see what types of queries people are doing and which tables people are accessing. Here is a query to access this information:

SELECT 
    users.first_name,
    users.last_name,
    queries.* 
FROM 
    "public"."query" AS queries
LEFT JOIN 
    "public"."ab_user" AS users 
ON 
    users.id = queries.user_id 
ORDER BY start_time DESC
LIMIT 1000;

I'm not sure if it's possible to track how long people spend on the site. Any thoughts @jdangerx?

We should also probably save all the Cloud Run logs to bigquery or cloud storage using a cloud sink. By default, logs are only saved for 30 days. I think these logs will be helpful for debugging cloud run failures, tracking cloud run resource use and total downloads.

bendnorman commented 1 month ago

Also @jdangerx, the auth0 screen has a warning about using development OAuth keys in production. I wonder if this is related the HTTP redirect issue we're having.