Decide on data exploration tool needs based on user conversations

bendnorman commented 2 months ago

Overview

We had some notion that we wanted to make a tool that would replace Datasette & allow access to all of our data, make charts & graphs easier, and let us gather some limited information about who our users are.

We've learned a lot by talking to many users over the last month - does that change anything about what our actual requirements are?

We've also learned about Superset & its strengths/limitations by prototyping it and testing it out over the last two months. We have a few questions to answer about its capabilities, and need to evaluate whether or not we want to continue trying to push it out to our users.

Success criteria

[x] we've enumerated what use cases we're trying to serve & their needs
[ ] we've figured out which users' needs are satisfied with Superset
[ ] we've figured out if there is a single tool that satisfies more users' needs
[ ] we've identified other tools for use cases that aren't covered by Superset/the other tool that seems even more promising

We also need answers for the following questions about Superset:

- [x] can a user use a pre-built dashboard to filter/subset data without SQL, then download as Excel? what are the limitations?
- [x] can a user use a pre-defined "Dataset" to filter/subset data without SQL, then download as Excel? what are the limitations?
- [x] can a no-SQL user create charts & graphs?
- [x] what are the performance and cost implications of using the `SQL_MAX_ROW` configuration value? how do user downloads affect egress?
- [x] how can we limit CSV downloads independent of the SQL query row limit?
- [x] will I somehow get incomplete data if I am subsetting data through data explorer or a dashboard?
- [x] investigate how to make datasets for a set of tables
- [x] see if you can send a link to a dataset/dashboard to someone who doesn’t have an account
- [ ] figure out how to redirect people directly to the welcome dashboard when they login

bendnorman commented 2 months ago

TL:DR: By creating superset datasets for every table in our database, users can easily create a chart and download it as a CSV or image. However, filtering all the rows and columns without SQL feels a little cumbersome. It's still unclear what the cost and performance repercussions are of increasing the max row size for chart and table creation.

1. Can a user use a pre-built dashboard to filter/subset data without SQL, then download as Excel? what are the limitations?

Short answer is yes but the number of rows they can view in a dashboard is limited. I was able to create a dashboard with a single table chart that can be filtered and downloaded as a CSV or excel file.

This feels like the most user-friendly no-SQL method filtering and downloading a large subset of table rows and columns. There are a couple of drawbacks:

We would need to create some deployment logic that would automatically create a dataset, chart, and dashboard for every table. Superset allows you to create these elements by importing a yaml file. We could create some yaml templates for dashboards, charts, and datasets and populate them using our metadata classes, then import them using the superset API.
It doesn't seem possible to create a table chart with all of the rows from a table. By default superset only returns 100k rows. You can change this by increasing the SQL_MAX_ROW variable in superset_config.py. This also increases the number of rows you can retrieve via SQL Lab. I'm not sure if this will be a cost issue for us because the duckdb file and our cloud SQL instance should be in the same region/network. I think we would only incur an egress cost if they download the data. I'm also not sure how to limit the size of a CSV download. Also, I'm not sure how increasing the SQL_MAX_ROW size will affect performance. Superset does offer a pagination option for Table charts but you can only download a single page at a time as a CSV. I left a question about these issues in Superset discussions.

2. can a user use a pre-defined "Dataset" to filter/subset data without SQL, then download as Excel? what are the limitations?

Short answer is yes, but superset Explore UI is more for building charts and not for quickly exploring and downloading data.

When you click on a dataset, it brings to the Explore UI. The left hand side shows a list of columns and any metrics (aggregated metrics defined in superset, we probably don't need these rn). You then have to select the Table chart, add the columns you need, apply any filters, specify the number of rows you want, and click "update chart". You can then download the displayed data as a CSV or excel using the sideways hamburger button in the upper right. There are some limitations:

There is a big blue "Save" button next to it which allows you to save the chart in superset. I could see some users thinking it is the button for downloading it as a csv.
There is no select all option when adding columns. You would need to add all columns one at a time which would be a huge pain if a user wanted all 110 columns of our generators table.
This generally feels like a workflow for creating charts and not for quickly filtering and downloading data. The filtering feels more difficult than the experience in CKAN. Is it important for users to be able to download the raw data or would being able to quickly create a chart be sufficient? I could imagine superset being a tools for creating quick visualizations and using s3 to deliver csvs of the full tables.
The same row limit questions and issues apply.

3. can a no-SQL user create charts & graphs?

Yes! It is the same process for creating a table chart. This felt like the most natural exploration method. Users can quickly add a couple of columns and metrics to a line or bar chart, plot the results, and download it as a PNG. However, I think we run into the same row limit issue as the other workflows. This is the description of the Row Limit option:

Limits the number of the rows that are computed in the query that is the source of the data used for this chart.

This makes me think if your filter returns > the row limit, your chart/table will be missing data. Again, it's unclear what cost and performance repercussions are of increasing the row limit beyond the default max of 50k.

We would also need to programmatically create datasets using the API. This seems pretty doable by creating dataset yaml files for each table and uploading them using the API.

Another advantage of creating datasets for every table is that we'd get table and column descriptions! Yay! You specify the description in the UI or in the yaml file. However, the formatting is limited:

zaneselvans commented 2 months ago

Is it important for users to be able to download the raw data or would being able to quickly create a chart be sufficient?

I think it's important to be able to download the data for local use in Excel. Many of the orgs / individuals we spoke with are very comfortable doing analysis and charting in Excel, and if they have to re-learn how to do that analysis or make charts through a web UI, I got the sense that many would not be interested.

At the same time, if we know that the "filter & download via a web UI" folks are using Excel locally, that puts an upper bound on the size of the data they'll actually be able to work with, and we probably don't need to worry about making it easy to download more than that. What are the hard limits / practical limits for how many rows Excel can handle?

In many cases I think you'd see folks selecting data associated with a single utility or power plant, which would often be less than 100K rows.

Another advantage of creating datasets for every table is that we'd get table and column descriptions! Having this context-dependent documentation is great, but I don't think it's going to be sufficient to make the large and growing universe of PUDL databases, tables, and columns navigable.

At a minimum I think we need:

Some kind of table grouping or hierarchical organization
Full-text search that covers table and column names, as well as table and column descriptions, ideally providing real-time updates of the list of matching entities.

This is why I started poking around to see what kinds of open source "data catalog" projects exist. It seems like those might (?) be closer to providing some of the services we need, at least for data discoverability. Not sure if any of them allow filter-and-download-CSV though.

zaneselvans commented 2 months ago

We might want to talk to some potential Excel users and see if they would be comfortable using Excel's built-in PowerQuery (Get & Transform) functionality to read data from an API. It seems like somewhat advanced spreadsheet usage, but would have the benefit of being an incremental thing to learn within a tool that the user is already familiar with and committed to, and hopefully it's a piece of functionality that's well documented and supported by Microsoft.

It looks like similar functionality in Google Sheets requires a 3rd party extension.

bendnorman commented 2 months ago

We revisited our user requirements and discussed if superset meets them. We generally agreed Superset doesn't meet all of the requirements perfectly but It should be able to serve most of our users. There are a handful of remaining question to answer before we move forward with user testing.

Will I somehow get incomplete data if I am subsetting data through data explorer or a dashboard?

Good news! Table charts in Dashboards and the Explore UI can filter the entire database! The amount of data returned is limited by these config variables:

# "row limit when requesting chart data"
ROW_LIMIT = 100_000
# "Maximum number of rows returned for any analytical database query"
SQL_MAX_ROW = 100_000

# I think ROW_LIMIT has to be less than or equal to SQL_MAX_ROW

# "row limit for native filters" I'm not sure exactly what this means but I think it generally controls the number of values
# that can appear in the filters. I think we should set this to a high number to accommodate for the number of values in common filter fields (state, utility, plant_id_eia, balancing authority...). 
NATIVE_FILTER_DEFAULT_ROW_LIMIT = 200_000
# "max rows retrieved by filter select auto complete"
FILTER_SELECT_ROW_LIMIT = 200_000

I made sure we can access the entire database by comparing the number of distinct values available in the dashboard filters and in SQL Lab:

I think plant_name is off by one because the filter includes a NULL value. When I filter for a value, superset returns the matching rows up to ROW_LIMIT. ROW_LIMIT also controls the number of rows you can download as a CSV. We should set ROW_LIMIT such that downloading ROW_LIMIT rows of our widest table is of an acceptable size given our cost constraints.

One issue with this UX is that superset doesn't produce a warning if a SQL query returns more rows than can be displayed on a chart. For example, I limited the table size to 10, then I filtered the out_eia__yearly_generators table where plant_id_eia = 8 which should return 120 rows and I didn't get an error. Also, if someone filters a table in a way that will return more rows than SQL_MAX_ROW, superset has no way of knowing if there are additional rows that should be returned.

We could mitigate this by only providing filters on columns that will return less rows than ROW_LIMIT.

I also noticed some potentially helpful filter settings:

You can require a filter to be set. I could see this as a way to avoid users expecting all the data but only getting ROW_LIMIT number of rows.
You can dynamically load filter values which should improve the initial filter UI load time. Prior to enabling this loading all 16k filter values was a little slow.

How can we programmatically create datasets?

I think for this first round of user testing we should just manually create a handful of datasets and dashboards ourselves. Programatically creating datasets, charts and dashboards seems doable. We'll have to create some yaml templates, populate them with our table metadata then import them to superset using the API.

Can I share dashboards with someone who doesn't have an account?

Yes! We can give permissions to the Public role. We can also assign dashboard level access to the Public role. What superset features would we like unregistered users to be able to see? I'm imagining this would be helpful for interesting dashboards we share on socials. I think we'll want to require registration for most features: SQL lab, chart creation, data filtering/downloading.

figure out how to redirect people directly to the welcome dashboard when they login

I'm not sure how to do this but it feels like it should be do able.

bendnorman commented 2 months ago

Would love for people to test out filtering a table chart on a Dashboard before we test this with users!

jdangerx commented 2 months ago

Poked around OpenMetadata.

Pros/Cons

Pros: * there's a search bar & it does a nice fuzzy search of the full text. great for the "do you have data that interests me?" part. * better than ctrl-f in that it does real search things, presumably doing stemming & TF-IDF stuff. * it shows sample data! * it also shows column-based metrics: how many of these values are unique? how many are null? etc. * you can download tables directly as CSVs * I guess collaborative commenting on data is cool? Cons: * documentation of many features is totally non-existent - such as the table download functionality... * you can download tables, but only full tables, as CSVs - as far as I can tell, no way to subset beforehand * no visualization tools * It's designed to work with an Airflow instance to ingest a bunch of metadata, so we'd have to learn/mess with that. * looks like we'd need to do some custom work with the Airflow ingestion to add the metadata we want (e.g. a download link for a parquet file, a dataset link for Superset).

Unsurprisingly, it focuses on answering the "do you have data that interests me?" but does a good job of it - giving lots of high level information about a dataset.

Seems like it also does a lot more than we really need. While our metadata is too big and complicated for a human brain to comprehend, it's not at the scale that a lot of these data catalog tools are designed for. The core thing we need is, I think:

full text search of table schema (and maybe have some tagging functionality too? not sure about that)
display metadata:
- description of table
- column description
- parquet link (stable + nightly/"preliminary")
- superset dataset link

Which we could get with Ctrl-F on our existing data dictionary, and replacing the "browse or query this table in Datasette" link with the other ones.

And we could incrementally improve that by injecting some Javascript to allow filtering of the view

I guess we could do the "Superset Meta-dashboard", though I think that doesn't support full-text search either. You'd have to go into the data explorer to be able to filter by "table description ILIKE 'generation'"

bendnorman commented 2 months ago

Thanks for looking into open metadata @jdangerx! I've also been poking around at the metadata tools. Let's create a separate issue or discussion to add our thoughts.

bendnorman commented 1 month ago

Questions for other Inframundo folks

Does the dashboard filtering feel like a satisfactory UX for downloading CSVs without SQL? Check out this dashboard.
Less important but how are people imagining we’ll use dashboards that don’t require registration?

e-belfer commented 1 month ago

We'll need to scream at people about the 100k row limit, probably by enforcing at least one column as a filter on these charts (e.g., plant ID, year or state). And, we'll need to scream about where the download link is. But I think this dashboard does the trick.

Note that selecting "Show all" even for a year of data in this dashboard causes some major slowdowns, so we'll want to encourage fairly small slices.

In terms of chart settings:

"Allow columns to be rearranged" is a nice one!
Color +/- and cell bars are both unnecessary and could be unchecked
"Search box" is also nice for searching but doesn't propagate to downloads, so we should probably pass on it and push people to use the filtering options

Less important but how are people imagining we’ll use dashboards that don’t require registration?

I think this would be mostly a welcome page with some information about the project, links to the metadata etc. We could show a few visualizations without the option to download (if this is possible) to encourage registration, but as a MVP I think this could mostly be pulled from our site and the draft dashboard I'd made some time ago?

zschira commented 1 month ago

@bendnorman I did some playing around with filtering on the dashboard, and overall it feels great! Here's a couple thoughts I had while playing around:

Might be nice to have filter on a numerical column like net_generation_mwh just to get a sense for how that feels
Given that one of the main superset advantages over datasette is that it can handle larger tables, it might be nice to try and include a big hourly table in user testing. This way we could really stress test performance and see if users can effectively explore larger tables

As for the row_limit issues, I do think this could be pretty confusing for users given that there's no warning, and not even any obvious indication of the number of rows after applying filters. At bare minimum, I think we should have a prominent warning in the welcome page or something to inform users that they can't download a csv with more than row_limit records.

I could see it being a kind of big problem if someone downloaded data thinking it was complete, then did analysis on that data and came up with totally erroneous results due to missing data.

jdangerx commented 1 month ago

Might be nice to have filter on a numerical column like net_generation_mwh just to get a sense for how that feels

I tried setting this up! You can "Add/Edit filters" on the dashboard and add a "numerical range" filter. This turns into a slider, without any options to type in specific range values. It's a "meh" experience. Fast, but you'll have to set the range slightly wider than you need, then download and re-filter locally, which is annoying.

Does the dashboard filtering feel like a satisfactory UX for downloading CSVs without SQL? Check out this dashboard.

I think it's not good but good enough. We can funnel users with greater needs towards more advanced functionality.

Side note: while I believe people don't want to "skill up", I also think people are smart enough to modify the SELECT col1, col2, col3 FROM table WHERE col1 <= 1 AND col2 > 2 template without "learning SQL." And that lets people filter things down to excellable size.

it seems clear that you need to select filters and then click "apply filter" to actually apply them ~ the numerical range thing is just OK.
the row limit thing is sneaky, though we can include a big warning at the top of every dashboard. "if you have more than 500 pages of 200 rows/page it is being truncated!!! please filter!"

Less important but how are people imagining we’ll use dashboards that don’t require registration?

I think we'd mostly be using them to show off our cool data & the tools you can build with it. Then people who are interested in building their own tools can register. I like Ella's point about the landing page being public too!

bendnorman commented 1 month ago

@zaneselvans thoughts from Slack:

I thought that the “Dataset” interface was good enough at CSV downloads to at least give it a shot with some users and see what they thought.

Being able to share links to a particular query output or visualization is really nice for collaboration and social media / embedding. But there’s also an advantage to us for people needing to log in, if it doesn’t mean lots of people bounce. Since then we get to know who they are. So I was imagining that the public would be view-only, no downloads, no fiddling with filters or queries. But I don’t think this is vital at the moment while we’re still experimenting.

bendnorman commented 1 month ago

Great feedback thanks y'all. I'll start creating some new issues for user testing.

I was able to add the row count has a big number to the dashboard and have it turn red if it goes over our row limit which is another simple way to remind folks.

I think ideally we'd have a very clear download button that pops up a warning window if a user is trying to download more rows than our limit.

bendnorman commented 1 month ago

I also added a sample data dictionary to that dashboard.

catalyst-cooperative / pudl