catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
481 stars 110 forks source link

Decide on data exploration tool needs based on user conversations #3832

Closed bendnorman closed 1 month ago

bendnorman commented 2 months ago

Overview

We had some notion that we wanted to make a tool that would replace Datasette & allow access to all of our data, make charts & graphs easier, and let us gather some limited information about who our users are.

We've learned a lot by talking to many users over the last month - does that change anything about what our actual requirements are?

We've also learned about Superset & its strengths/limitations by prototyping it and testing it out over the last two months. We have a few questions to answer about its capabilities, and need to evaluate whether or not we want to continue trying to push it out to our users.

Success criteria

We also need answers for the following questions about Superset:

- [x] can a user use a pre-built dashboard to filter/subset data without SQL, then download as Excel? what are the limitations?
- [x] can a user use a pre-defined "Dataset" to filter/subset data without SQL, then download as Excel? what are the limitations?
- [x] can a no-SQL user create charts & graphs?
- [x] what are the performance and cost implications of using the `SQL_MAX_ROW` configuration value? how do user downloads affect egress?
- [x] how can we limit CSV downloads independent of the SQL query row limit?
- [x] will I somehow get incomplete data if I am subsetting data through data explorer or a dashboard?
- [x] investigate how to make datasets for a set of tables
- [x] see if you can send a link to a dataset/dashboard to someone who doesn’t have an account
- [ ] figure out how to redirect people directly to the welcome dashboard when they login
bendnorman commented 2 months ago

TL:DR: By creating superset datasets for every table in our database, users can easily create a chart and download it as a CSV or image. However, filtering all the rows and columns without SQL feels a little cumbersome. It's still unclear what the cost and performance repercussions are of increasing the max row size for chart and table creation.

1. Can a user use a pre-built dashboard to filter/subset data without SQL, then download as Excel? what are the limitations?

Short answer is yes but the number of rows they can view in a dashboard is limited. I was able to create a dashboard with a single table chart that can be filtered and downloaded as a CSV or excel file.

This feels like the most user-friendly no-SQL method filtering and downloading a large subset of table rows and columns. There are a couple of drawbacks:

2. can a user use a pre-defined "Dataset" to filter/subset data without SQL, then download as Excel? what are the limitations?

Short answer is yes, but superset Explore UI is more for building charts and not for quickly exploring and downloading data.

When you click on a dataset, it brings to the Explore UI. The left hand side shows a list of columns and any metrics (aggregated metrics defined in superset, we probably don't need these rn). You then have to select the Table chart, add the columns you need, apply any filters, specify the number of rows you want, and click "update chart". You can then download the displayed data as a CSV or excel using the sideways hamburger button in the upper right. There are some limitations:

3. can a no-SQL user create charts & graphs?

Yes! It is the same process for creating a table chart. This felt like the most natural exploration method. Users can quickly add a couple of columns and metrics to a line or bar chart, plot the results, and download it as a PNG. However, I think we run into the same row limit issue as the other workflows. This is the description of the Row Limit option:

Limits the number of the rows that are computed in the query that is the source of the data used for this chart.

This makes me think if your filter returns > the row limit, your chart/table will be missing data. Again, it's unclear what cost and performance repercussions are of increasing the row limit beyond the default max of 50k.

We would also need to programmatically create datasets using the API. This seems pretty doable by creating dataset yaml files for each table and uploading them using the API.

Another advantage of creating datasets for every table is that we'd get table and column descriptions! Yay! You specify the description in the UI or in the yaml file. However, the formatting is limited:

Image

Image

zaneselvans commented 2 months ago

Is it important for users to be able to download the raw data or would being able to quickly create a chart be sufficient?

I think it's important to be able to download the data for local use in Excel. Many of the orgs / individuals we spoke with are very comfortable doing analysis and charting in Excel, and if they have to re-learn how to do that analysis or make charts through a web UI, I got the sense that many would not be interested.

At the same time, if we know that the "filter & download via a web UI" folks are using Excel locally, that puts an upper bound on the size of the data they'll actually be able to work with, and we probably don't need to worry about making it easy to download more than that. What are the hard limits / practical limits for how many rows Excel can handle?

In many cases I think you'd see folks selecting data associated with a single utility or power plant, which would often be less than 100K rows.

Another advantage of creating datasets for every table is that we'd get table and column descriptions! Having this context-dependent documentation is great, but I don't think it's going to be sufficient to make the large and growing universe of PUDL databases, tables, and columns navigable.

At a minimum I think we need:

This is why I started poking around to see what kinds of open source "data catalog" projects exist. It seems like those might (?) be closer to providing some of the services we need, at least for data discoverability. Not sure if any of them allow filter-and-download-CSV though.

zaneselvans commented 2 months ago

We might want to talk to some potential Excel users and see if they would be comfortable using Excel's built-in PowerQuery (Get & Transform) functionality to read data from an API. It seems like somewhat advanced spreadsheet usage, but would have the benefit of being an incremental thing to learn within a tool that the user is already familiar with and committed to, and hopefully it's a piece of functionality that's well documented and supported by Microsoft.

It looks like similar functionality in Google Sheets requires a 3rd party extension.

bendnorman commented 2 months ago

We revisited our user requirements and discussed if superset meets them. We generally agreed Superset doesn't meet all of the requirements perfectly but It should be able to serve most of our users. There are a handful of remaining question to answer before we move forward with user testing.

Will I somehow get incomplete data if I am subsetting data through data explorer or a dashboard?

Good news! Table charts in Dashboards and the Explore UI can filter the entire database! The amount of data returned is limited by these config variables:

# "row limit when requesting chart data"
ROW_LIMIT = 100_000
# "Maximum number of rows returned for any analytical database query"
SQL_MAX_ROW = 100_000

# I think ROW_LIMIT has to be less than or equal to SQL_MAX_ROW

# "row limit for native filters" I'm not sure exactly what this means but I think it generally controls the number of values
# that can appear in the filters. I think we should set this to a high number to accommodate for the number of values in common filter fields (state, utility, plant_id_eia, balancing authority...). 
NATIVE_FILTER_DEFAULT_ROW_LIMIT = 200_000
# "max rows retrieved by filter select auto complete"
FILTER_SELECT_ROW_LIMIT = 200_000

I made sure we can access the entire database by comparing the number of distinct values available in the dashboard filters and in SQL Lab:

image image

I think plant_name is off by one because the filter includes a NULL value. When I filter for a value, superset returns the matching rows up to ROW_LIMIT. ROW_LIMIT also controls the number of rows you can download as a CSV. We should set ROW_LIMIT such that downloading ROW_LIMIT rows of our widest table is of an acceptable size given our cost constraints.

One issue with this UX is that superset doesn't produce a warning if a SQL query returns more rows than can be displayed on a chart. For example, I limited the table size to 10, then I filtered the out_eia__yearly_generators table where plant_id_eia = 8 which should return 120 rows and I didn't get an error. Also, if someone filters a table in a way that will return more rows than SQL_MAX_ROW, superset has no way of knowing if there are additional rows that should be returned.

We could mitigate this by only providing filters on columns that will return less rows than ROW_LIMIT.

I also noticed some potentially helpful filter settings:

How can we programmatically create datasets?

I think for this first round of user testing we should just manually create a handful of datasets and dashboards ourselves. Programatically creating datasets, charts and dashboards seems doable. We'll have to create some yaml templates, populate them with our table metadata then import them to superset using the API.

Can I share dashboards with someone who doesn't have an account?

Yes! We can give permissions to the Public role. We can also assign dashboard level access to the Public role. What superset features would we like unregistered users to be able to see? I'm imagining this would be helpful for interesting dashboards we share on socials. I think we'll want to require registration for most features: SQL lab, chart creation, data filtering/downloading.

figure out how to redirect people directly to the welcome dashboard when they login

I'm not sure how to do this but it feels like it should be do able.

bendnorman commented 2 months ago

Would love for people to test out filtering a table chart on a Dashboard before we test this with users!

jdangerx commented 2 months ago

Poked around OpenMetadata.

Pros/Cons Pros: * there's a search bar & it does a nice fuzzy search of the full text. great for the "do you have data that interests me?" part. * better than ctrl-f in that it does real search things, presumably doing stemming & TF-IDF stuff. * it shows sample data! * it also shows column-based metrics: how many of these values are unique? how many are null? etc. * you can download tables directly as CSVs * I guess collaborative commenting on data is cool? Cons: * documentation of many features is totally non-existent - such as the table download functionality... * you can download tables, but only full tables, as CSVs - as far as I can tell, no way to subset beforehand * no visualization tools * It's designed to work with an Airflow instance to ingest a bunch of metadata, so we'd have to learn/mess with that. * looks like we'd need to do some custom work with the Airflow ingestion to add the metadata we want (e.g. a download link for a parquet file, a dataset link for Superset).

Unsurprisingly, it focuses on answering the "do you have data that interests me?" but does a good job of it - giving lots of high level information about a dataset.

Seems like it also does a lot more than we really need. While our metadata is too big and complicated for a human brain to comprehend, it's not at the scale that a lot of these data catalog tools are designed for. The core thing we need is, I think:

Which we could get with Ctrl-F on our existing data dictionary, and replacing the "browse or query this table in Datasette" link with the other ones.

And we could incrementally improve that by injecting some Javascript to allow filtering of the view

I guess we could do the "Superset Meta-dashboard", though I think that doesn't support full-text search either. You'd have to go into the data explorer to be able to filter by "table description ILIKE 'generation'"

bendnorman commented 2 months ago

Thanks for looking into open metadata @jdangerx! I've also been poking around at the metadata tools. Let's create a separate issue or discussion to add our thoughts.

bendnorman commented 1 month ago

Questions for other Inframundo folks

e-belfer commented 1 month ago

We'll need to scream at people about the 100k row limit, probably by enforcing at least one column as a filter on these charts (e.g., plant ID, year or state). And, we'll need to scream about where the download link is. But I think this dashboard does the trick.

Note that selecting "Show all" even for a year of data in this dashboard causes some major slowdowns, so we'll want to encourage fairly small slices.

In terms of chart settings:

Less important but how are people imagining we’ll use dashboards that don’t require registration?

I think this would be mostly a welcome page with some information about the project, links to the metadata etc. We could show a few visualizations without the option to download (if this is possible) to encourage registration, but as a MVP I think this could mostly be pulled from our site and the draft dashboard I'd made some time ago?

zschira commented 1 month ago

@bendnorman I did some playing around with filtering on the dashboard, and overall it feels great! Here's a couple thoughts I had while playing around:

As for the row_limit issues, I do think this could be pretty confusing for users given that there's no warning, and not even any obvious indication of the number of rows after applying filters. At bare minimum, I think we should have a prominent warning in the welcome page or something to inform users that they can't download a csv with more than row_limit records.

I could see it being a kind of big problem if someone downloaded data thinking it was complete, then did analysis on that data and came up with totally erroneous results due to missing data.

jdangerx commented 1 month ago
  • Might be nice to have filter on a numerical column like net_generation_mwh just to get a sense for how that feels

I tried setting this up! You can "Add/Edit filters" on the dashboard and add a "numerical range" filter. This turns into a slider, without any options to type in specific range values. It's a "meh" experience. Fast, but you'll have to set the range slightly wider than you need, then download and re-filter locally, which is annoying.

  • Does the dashboard filtering feel like a satisfactory UX for downloading CSVs without SQL? Check out this dashboard.

I think it's not good but good enough. We can funnel users with greater needs towards more advanced functionality.

Side note: while I believe people don't want to "skill up", I also think people are smart enough to modify the SELECT col1, col2, col3 FROM table WHERE col1 <= 1 AND col2 > 2 template without "learning SQL." And that lets people filter things down to excellable size.

  • Less important but how are people imagining we’ll use dashboards that don’t require registration?

I think we'd mostly be using them to show off our cool data & the tools you can build with it. Then people who are interested in building their own tools can register. I like Ella's point about the landing page being public too!

bendnorman commented 1 month ago

@zaneselvans thoughts from Slack:

I thought that the “Dataset” interface was good enough at CSV downloads to at least give it a shot with some users and see what they thought.

Being able to share links to a particular query output or visualization is really nice for collaboration and social media / embedding. But there’s also an advantage to us for people needing to log in, if it doesn’t mean lots of people bounce. Since then we get to know who they are. So I was imagining that the public would be view-only, no downloads, no fiddling with filters or queries. But I don’t think this is vital at the moment while we’re still experimenting.

bendnorman commented 1 month ago

Great feedback thanks y'all. I'll start creating some new issues for user testing.

I was able to add the row count has a big number to the dashboard and have it turn red if it goes over our row limit which is another simple way to remind folks.

I think ideally we'd have a very clear download button that pops up a warning window if a user is trying to download more rows than our limit.

bendnorman commented 1 month ago

I also added a sample data dictionary to that dashboard.