Closed bendnorman closed 1 month ago
TL:DR: By creating superset datasets for every table in our database, users can easily create a chart and download it as a CSV or image. However, filtering all the rows and columns without SQL feels a little cumbersome. It's still unclear what the cost and performance repercussions are of increasing the max row size for chart and table creation.
Short answer is yes but the number of rows they can view in a dashboard is limited. I was able to create a dashboard with a single table chart that can be filtered and downloaded as a CSV or excel file.
This feels like the most user-friendly no-SQL method filtering and downloading a large subset of table rows and columns. There are a couple of drawbacks:
SQL_MAX_ROW
variable in superset_config.py
. This also increases the number of rows you can retrieve via SQL Lab. I'm not sure if this will be a cost issue for us because the duckdb file and our cloud SQL instance should be in the same region/network. I think we would only incur an egress cost if they download the data. I'm also not sure how to limit the size of a CSV download. Also, I'm not sure how increasing the SQL_MAX_ROW
size will affect performance. Superset does offer a pagination option for Table charts but you can only download a single page at a time as a CSV. I left a question about these issues in Superset discussions.Short answer is yes, but superset Explore UI is more for building charts and not for quickly exploring and downloading data.
When you click on a dataset, it brings to the Explore UI. The left hand side shows a list of columns and any metrics (aggregated metrics defined in superset, we probably don't need these rn). You then have to select the Table chart, add the columns you need, apply any filters, specify the number of rows you want, and click "update chart". You can then download the displayed data as a CSV or excel using the sideways hamburger button in the upper right. There are some limitations:
Yes! It is the same process for creating a table chart. This felt like the most natural exploration method. Users can quickly add a couple of columns and metrics to a line or bar chart, plot the results, and download it as a PNG. However, I think we run into the same row limit issue as the other workflows. This is the description of the Row Limit option:
Limits the number of the rows that are computed in the query that is the source of the data used for this chart.
This makes me think if your filter returns > the row limit, your chart/table will be missing data. Again, it's unclear what cost and performance repercussions are of increasing the row limit beyond the default max of 50k.
We would also need to programmatically create datasets using the API. This seems pretty doable by creating dataset yaml files for each table and uploading them using the API.
Another advantage of creating datasets for every table is that we'd get table and column descriptions! Yay! You specify the description in the UI or in the yaml file. However, the formatting is limited:
Is it important for users to be able to download the raw data or would being able to quickly create a chart be sufficient?
I think it's important to be able to download the data for local use in Excel. Many of the orgs / individuals we spoke with are very comfortable doing analysis and charting in Excel, and if they have to re-learn how to do that analysis or make charts through a web UI, I got the sense that many would not be interested.
At the same time, if we know that the "filter & download via a web UI" folks are using Excel locally, that puts an upper bound on the size of the data they'll actually be able to work with, and we probably don't need to worry about making it easy to download more than that. What are the hard limits / practical limits for how many rows Excel can handle?
In many cases I think you'd see folks selecting data associated with a single utility or power plant, which would often be less than 100K rows.
Another advantage of creating datasets for every table is that we'd get table and column descriptions! Having this context-dependent documentation is great, but I don't think it's going to be sufficient to make the large and growing universe of PUDL databases, tables, and columns navigable.
At a minimum I think we need:
This is why I started poking around to see what kinds of open source "data catalog" projects exist. It seems like those might (?) be closer to providing some of the services we need, at least for data discoverability. Not sure if any of them allow filter-and-download-CSV though.
We might want to talk to some potential Excel users and see if they would be comfortable using Excel's built-in PowerQuery (Get & Transform) functionality to read data from an API. It seems like somewhat advanced spreadsheet usage, but would have the benefit of being an incremental thing to learn within a tool that the user is already familiar with and committed to, and hopefully it's a piece of functionality that's well documented and supported by Microsoft.
It looks like similar functionality in Google Sheets requires a 3rd party extension.
We revisited our user requirements and discussed if superset meets them. We generally agreed Superset doesn't meet all of the requirements perfectly but It should be able to serve most of our users. There are a handful of remaining question to answer before we move forward with user testing.
Good news! Table charts in Dashboards and the Explore UI can filter the entire database! The amount of data returned is limited by these config variables:
# "row limit when requesting chart data"
ROW_LIMIT = 100_000
# "Maximum number of rows returned for any analytical database query"
SQL_MAX_ROW = 100_000
# I think ROW_LIMIT has to be less than or equal to SQL_MAX_ROW
# "row limit for native filters" I'm not sure exactly what this means but I think it generally controls the number of values
# that can appear in the filters. I think we should set this to a high number to accommodate for the number of values in common filter fields (state, utility, plant_id_eia, balancing authority...).
NATIVE_FILTER_DEFAULT_ROW_LIMIT = 200_000
# "max rows retrieved by filter select auto complete"
FILTER_SELECT_ROW_LIMIT = 200_000
I made sure we can access the entire database by comparing the number of distinct values available in the dashboard filters and in SQL Lab:
I think plant_name
is off by one because the filter includes a NULL
value. When I filter for a value, superset returns the matching rows up to ROW_LIMIT
. ROW_LIMIT
also controls the number of rows you can download as a CSV. We should set ROW_LIMIT
such that downloading ROW_LIMIT
rows of our widest table is of an acceptable size given our cost constraints.
One issue with this UX is that superset doesn't produce a warning if a SQL query returns more rows than can be displayed on a chart. For example, I limited the table size to 10, then I filtered the out_eia__yearly_generators
table where plant_id_eia = 8
which should return 120 rows and I didn't get an error. Also, if someone filters a table in a way that will return more rows than SQL_MAX_ROW
, superset has no way of knowing if there are additional rows that should be returned.
We could mitigate this by only providing filters on columns that will return less rows than ROW_LIMIT
.
I also noticed some potentially helpful filter settings:
ROW_LIMIT
number of rows.I think for this first round of user testing we should just manually create a handful of datasets and dashboards ourselves. Programatically creating datasets, charts and dashboards seems doable. We'll have to create some yaml templates, populate them with our table metadata then import them to superset using the API.
Yes! We can give permissions to the Public role. We can also assign dashboard level access to the Public role. What superset features would we like unregistered users to be able to see? I'm imagining this would be helpful for interesting dashboards we share on socials. I think we'll want to require registration for most features: SQL lab, chart creation, data filtering/downloading.
I'm not sure how to do this but it feels like it should be do able.
Would love for people to test out filtering a table chart on a Dashboard before we test this with users!
Poked around OpenMetadata.
Unsurprisingly, it focuses on answering the "do you have data that interests me?" but does a good job of it - giving lots of high level information about a dataset.
Seems like it also does a lot more than we really need. While our metadata is too big and complicated for a human brain to comprehend, it's not at the scale that a lot of these data catalog tools are designed for. The core thing we need is, I think:
Which we could get with Ctrl-F on our existing data dictionary, and replacing the "browse or query this table in Datasette" link with the other ones.
And we could incrementally improve that by injecting some Javascript to allow filtering of the view
I guess we could do the "Superset Meta-dashboard", though I think that doesn't support full-text search either. You'd have to go into the data explorer to be able to filter by "table description ILIKE 'generation'"
Thanks for looking into open metadata @jdangerx! I've also been poking around at the metadata tools. Let's create a separate issue or discussion to add our thoughts.
We'll need to scream at people about the 100k row limit, probably by enforcing at least one column as a filter on these charts (e.g., plant ID, year or state). And, we'll need to scream about where the download link is. But I think this dashboard does the trick.
Note that selecting "Show all" even for a year of data in this dashboard causes some major slowdowns, so we'll want to encourage fairly small slices.
In terms of chart settings:
Less important but how are people imagining we’ll use dashboards that don’t require registration?
I think this would be mostly a welcome page with some information about the project, links to the metadata etc. We could show a few visualizations without the option to download (if this is possible) to encourage registration, but as a MVP I think this could mostly be pulled from our site and the draft dashboard I'd made some time ago?
@bendnorman I did some playing around with filtering on the dashboard, and overall it feels great! Here's a couple thoughts I had while playing around:
net_generation_mwh
just to get a sense for how that feelsAs for the row_limit
issues, I do think this could be pretty confusing for users given that there's no warning, and not even any obvious indication of the number of rows after applying filters. At bare minimum, I think we should have a prominent warning in the welcome page or something to inform users that they can't download a csv with more than row_limit
records.
I could see it being a kind of big problem if someone downloaded data thinking it was complete, then did analysis on that data and came up with totally erroneous results due to missing data.
- Might be nice to have filter on a numerical column like net_generation_mwh just to get a sense for how that feels
I tried setting this up! You can "Add/Edit filters" on the dashboard and add a "numerical range" filter. This turns into a slider, without any options to type in specific range values. It's a "meh" experience. Fast, but you'll have to set the range slightly wider than you need, then download and re-filter locally, which is annoying.
- Does the dashboard filtering feel like a satisfactory UX for downloading CSVs without SQL? Check out this dashboard.
I think it's not good but good enough. We can funnel users with greater needs towards more advanced functionality.
Side note: while I believe people don't want to "skill up", I also think people are smart enough to modify the SELECT col1, col2, col3 FROM table WHERE col1 <= 1 AND col2 > 2
template without "learning SQL." And that lets people filter things down to excellable size.
- Less important but how are people imagining we’ll use dashboards that don’t require registration?
I think we'd mostly be using them to show off our cool data & the tools you can build with it. Then people who are interested in building their own tools can register. I like Ella's point about the landing page being public too!
@zaneselvans thoughts from Slack:
I thought that the “Dataset” interface was good enough at CSV downloads to at least give it a shot with some users and see what they thought.
Being able to share links to a particular query output or visualization is really nice for collaboration and social media / embedding. But there’s also an advantage to us for people needing to log in, if it doesn’t mean lots of people bounce. Since then we get to know who they are. So I was imagining that the public would be view-only, no downloads, no fiddling with filters or queries. But I don’t think this is vital at the moment while we’re still experimenting.
Great feedback thanks y'all. I'll start creating some new issues for user testing.
I was able to add the row count has a big number to the dashboard and have it turn red if it goes over our row limit which is another simple way to remind folks.
I think ideally we'd have a very clear download button that pops up a warning window if a user is trying to download more rows than our limit.
I also added a sample data dictionary to that dashboard.
Overview
We had some notion that we wanted to make a tool that would replace Datasette & allow access to all of our data, make charts & graphs easier, and let us gather some limited information about who our users are.
We've learned a lot by talking to many users over the last month - does that change anything about what our actual requirements are?
We've also learned about Superset & its strengths/limitations by prototyping it and testing it out over the last two months. We have a few questions to answer about its capabilities, and need to evaluate whether or not we want to continue trying to push it out to our users.
Success criteria
We also need answers for the following questions about Superset: