apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.07k stars 13.61k forks source link

Feature Request: Proposal for Improving Examples Interface, Organization and Storage #7571

Closed rjurney closed 3 years ago

rjurney commented 5 years ago

[SIP] Proposal for Improving Examples Interface, Organization and Storage

The goal of the changes in this proposal is to improve the examples capabilities of Superset so as to foster an ecosystem of examples which will sustain and grow as the platform continues to develop. First I will characterize the existing system of examples and then propose changes to improve the number and quality of examples.

Current Examples

Examples are currently programmatically defined in the superset.data module. An abstract interface summarizing these examples looks like the following:

from abc import ABC

class AbstractSupersetExample(ABC):
    """Defines interface through which superset examples load themselves."""

    def __init__(self, description):
        self.description = description

    def load_data(self):
        # Task 1: Load file and create pandas.DataFrame
        # Task 2: Load data into SQL with pandas.DataFrame.to_sql() 
        # Task 3: Process through ORM to get back workable Table object from whichever data source the table is in
        pass

    def create_metrics(self): 
        # Task 1: Build any TableColumns
        # Task 2: Build Metrics - SQLMetrics
        # Task 3: Store metrics in DB via ORM
        pass

    def create_charts(self, chart):
        # Task 1: Build chart from config/JSON
        # Task 2: Store to DB via - misc_dash_slices.add(slc.slice_name) / merge_slice(slc)
        pass

    def create_dashboards(self, name, config):
        # Task 1: Instantiate Dash via ORM
        # Task 2: Configure Dash via JSON
        # Task 3: Store to DB via ORM
        pass

While this mechanism jump started the collection of Superset examples, defining examples as code will not appeal to most Superset users of growing a community that contributes examples.

Current Dashboard Import/Export

Dashboard example creation could utilize the export feature by adding example oriented fields to the export. Dashboards can be exported via the Dashboard List interface at /dashboard/list/ via its Export action or at the command line via superset export_dashboards. Dashboard and chart export JSON includes everything needed to reproduce a dashboard save the actual data table: dashboard, chart and datasource information.

The datasources. __SqlaTable__.database element will need to be removed when examples are created and recreated when they are loaded to match SQLALCHEMY_EXAMPLES_URI or a —database-uri the user specifies. Each Slice’s datasource_id and datasource_name must be changed.

{
  "dashboards": [
    {
      "__Dashboard__": {
        "created_on": {
          "__datetime__": "2019-04-19T10:29:21"
        },
        "changed_by_fk": null,
        "slug": "world_health",
        "json_metadata": "{\"remote_id\": 1}",
        "description": null,
        "dashboard_title": "World's Bank Data",
        "changed_on": {
          "__datetime__": "2019-05-02T17:55:47"
        },
        "created_by_fk": null,
        "css": null,
        "position_json": {...},
  "id": 1,
  "slices": [
    {
      "__Slice__": {
        "created_by_fk": null,
        "cache_timeout": null,
        "params": "{\"compare_lag\": \"10\", \"compare_suffix\": \"o10Y\", \"country_fieldtype\": \"cca3\", \"date_filter\": false, \"entity\": \"country_code\", \"filter_configs\": [{\"asc\": false, \"clearable\": true, \"column\": \"region\", \"key\": \"2s98dfu\", \"metric\": \"sum__SP_POP_TOTL\", \"multiple\": true}, {\"asc\": false, \"clearable\": true, \"column\": \"country_name\", \"key\": \"li3j2lk\", \"metric\": \"sum__SP_POP_TOTL\", \"multiple\": true}], \"granularity_sqla\": \"year\", \"groupby\": [], \"limit\": \"25\", \"markup_type\": \"markdown\", \"metric\": \"sum__SP_POP_TOTL\", \"metrics\": [\"sum__SP_POP_TOTL\"], \"row_limit\": 50000, \"secondary_metric\": \"sum__SP_POP_TOTL\", \"show_bubbles\": true, \"since\": \"2014-01-01\", \"time_range\": \"2014-01-01 : 2014-01-02\", \"until\": \"2014-01-02\", \"viz_type\": \"filter_box\", \"where\": \"\", \"remote_id\": 371, \"datasource_name\": \"wb_health_population\", \"schema\": \"wb_health_population\", \"database_name\": \"main\"}",
        "datasource_name": null,
        "datasource_type": "table",
        "slice_name": "Region Filter",
        "changed_on": {
          "__datetime__": "2019-05-02T18:00:39"
        },
        "changed_by_fk": 1,
        "perm": "[main].[wb_health_population](id:2)",
        "description": null,
        "viz_type": "filter_box",
        "datasource_id": 2,
        "id": 371,
        "created_on": {
          "__datetime__": "2019-05-02T17:55:47"
        },
        "owners": []
      }
    }
  ],
  "datasources": [
    {
      "__SqlaTable__": {
        "created_on": {
          "__datetime__": "2019-04-19T10:29:20"
        },
        "sql": null,
        "cache_timeout": null,
        "changed_on": {
          "__datetime__": "2019-05-02T17:55:47"
        },
        "is_sqllab_view": false,
        "params": "{\"remote_id\": 2, \"database_name\": \"main\"}",
        "id": 2,
        "template_params": null,
        "perm": "[main].[wb_health_population](id:2)",
        "description": "<!--\nLicensed to the Apache Software Foundation...",
          "created_by_fk": null,
        "table_name": "wb_health_population",
        "default_endpoint": null,
        "changed_by_fk": null,
        "main_dttm_col": "year",
        "is_featured": false,
        "database_id": 1,
        "filter_select_enabled": true,
        "fetch_values_predicate": null,
        "offset": 0,
        "schema": null,
        "columns": [
          {
            "__TableColumn__": {
              "created_by_fk": null,
              "python_date_format": null,
              "is_dttm": false,
              "description": null,
              "groupby": false,
              "type": "FLOAT",
              "verbose_name": null,
              "id": 4,
              "changed_by_fk": null,
              "created_on": {
                "__datetime__": "2019-04-19T10:29:20"
              },
              "table_id": 2,
              "filterable": false,
              "expression": "",
              "database_expression": null,
              "is_active": true,
              "column_name": "NY_GNP_PCAP_CD",
              "changed_on": {
                "__datetime__": "2019-04-19T10:29:20"
              }
            }
          },
            ...
          ],
        "metrics": [
          {
            "__SqlMetric__": {
              "created_by_fk": null,
              "table_id": 2,
              "d3format": null,
              "description": null,
              "metric_name": "sum__SP_POP_TOTL",
              "changed_on": {
                "__datetime__": "2019-04-19T10:29:20"
              },
              "metric_type": null,
              "changed_by_fk": null,
              "expression": "sum(SP_POP_TOTL)",
              "warning_text": null,
              "is_restricted": false,
              "verbose_name": null,
              "id": 3,
              "created_on": {
                "__datetime__": "2019-04-19T10:29:20"
              }
            }
          },
            ...
          ],
          "database": {
          "__Database__": {
            "allow_csv_upload": true,
            "verbose_name": null,
            "created_by_fk": null,
            "allow_ctas": false,
            "database_name": "main",
            "changed_by_fk": null,
            "allow_dml": false,
            "sqlalchemy_uri": "sqlite:////Users/rjurney/.superset/superset.db",
            "force_ctas_schema": null,
            "password": null,
            "allow_multi_schema_metadata_fetch": false,
            "cache_timeout": null,
            "created_on": {
              "__datetime__": "2019-04-19T10:28:59"
            },
            "extra": "{\n    \"metadata_params\": {},\n    \"engine_params\": {},\n    \"metadata_cache_timeout\": {},\n    \"schemas_allowed_for_csv_upload\": []\n}\n",
            "select_as_create_table_as": false,
            "perm": "[main].(id:1)",
            "changed_on": {
              "__datetime__": "2019-04-19T10:28:59"
            },
            "expose_in_sqllab": true,
            "impersonate_user": false,
            "id": 1,
            "allow_run_async": false
          }
        }
      }
    },

A Slice object has a params_dict which contains the following. Note that this includes references to the datasource_name wb_health_population.

 {'compare_lag': '10',
  'compare_suffix': 'o10Y',
  'country_fieldtype': 'cca3',
  'entity': 'country_code',
  'granularity_sqla': 'year',
  'groupby': ['region', 'country_code'],
  'limit': '25',
  'markup_type': 'markdown',
  'metric': 'sum__SP_POP_TOTL',
  'metrics': ['sum__SP_POP_TOTL'],
  'row_limit': 50000,
  'secondary_metric': 'sum__SP_POP_TOTL',
  'show_bubbles': True,
  'since': '1960-01-01',
  'time_range': '2014-01-01 : 2014-01-02',
  'until': 'now',
  'viz_type': 'treemap',
  'where': '',
  'remote_id': 708,
  'datasource_name': 'wb_health_population',
  'schema': 'wb_health_population',
  'database_name': 'main'}]

Example Components

A Superset example is a SQL oriented dashboard and is composed of the following:

* Physical database with tables loaded with a dataset
* Superset Datasource object
* Table object
    * Column objects
    * Metric objects
* Dashboard object
* One or more chart objects

All of the above with the exception of the Datasource.Database entry will need to be serialized, stored, contributed, approved, listed, deserialized and loaded by the example system. The Datasource.Database entry will need to be removed on exporting and replaced on importing of examples.

Scope of Improvement

This proposal improves the superset example process in three areas: example creation, data storage and discoverability.

In order to improve the range and quality of Superset examples we need to first improve the process for creating and loading examples. While examples can be created programmatically, the more natural process is to use Superset to create them. This requires that we automate the process to persist and restore the combined state of the Superset Dashboard, Database and related objects as well as the contents of the datasource itself.

We also need a directory to which examples can be uploaded and a corresponding user interface and process of governance over that repository. This directory should be independent of the Superset project release process and code repository. Current processes for management of Superset’s code assets would transition directly to the management of its examples: changes would be created by creating Github issues and pull requests, data assets would be versioned and managed in a central repository.

Finally we need a user interface for finding, listing and loading examples from the repository. It should be simple and can exist as an examples command as part of the superset CLI which will have export, list , import and remove sub-commands.

Example Repository Requirements

The requirement for example storage are that it have the following properties:

Git and Github are a desirable mechanism for publishing and approval but an undesirable mechanism for storage. Git LFS (Large File System) offers scalable storage while still using Github for project management. With a 2GB file limit and support on Github for 250 of these files, it scales well and is the proposed storage system. Other options are explored in the addendum.

Example File Format

Examples should be defined and packaged in a standard manner and each example should be self contained in its own file system directory. The existing Dashboard export format adequately describes a dashboard, it’s charts and the associated datasources but is missing human readable fields describing the contents of the dataset and dashboard as well as the physical location of the contents of the tables the datasource metadata describes. These fields will be added to the Superset dashboard export format.

Data location information will be stored in a top level files key next to the existing dashboards, slices and datasources keys. A top level description field will fill out the fields of a description of the dataset in the examples directory. The existing World’s Bank Data example is extended below:

{
    “dashboards”: [ ... ],
    “datasources”: [ ... ],
    “description”: {
        "created_at": "2019-05-20T16:20:24.883125",
        "description": "World Bank Data example about world health populations from 1960-2010.",
        "file_count": 1,
        "license": "Apache 2.0"
        "title": "World Bank Health Information",
        "total_rows": 11770,
        "total_size": 22561353,
        "total_size_mb": 21.52,
    },
    “files”: [ 
        {
            "file_name": "wb_health_population.csv.gz",
            "rows": 11770,
            "size": 22561353,
            "table_name": "wb_health_population"
        }
    ],
    “slices”: [ ... ]
}

The file layout for this example appears as follows, with the dashboard slug used as the directory name in the exported tarball and examples directory:

/
/world_health
/world_health/dashboard.json
/world_health/wb_health_population.csv.gz

Example Data Table Format

In order to manage tables, to create and drop them, it is helpful to assume that an Integer id primary column is present. This is the case for all current example dashboard tables. In the future we may want to support tables with uuid or other types of primary column.

New or Changed Public Interfaces

Changes include the addition of a SQLALCHEMY_EXAMPLES_URI and EXAMPLES_GIT_TAG configuration keys and changes to the model classes as well as the CLI.

The examples-data Repository

Currently the example data is on Github at apache-superset/examples-data. This will continue to be the case, but this repository will now house both Dashboard metadata files as well as data files via Git LFS. Each example will have its own directory with its own dashboard.json and data files.

The README.md for this repository in new new form can be accessed here: GitHub - rjurney/examples-data at lfs.

SQLALCHEMY_EXAMPLES_URI Configuration Key

A SQLALCHEMY_EXAMPLES_URI configuration key in superset/config.py controls the default location to load examples into. This defaults to ~/.superset/examples.db and can be over-ridden on a per-import basis using the --database-uri/-d option.

EXAMPLE_REPOS_TAGS Configuration Key

A EXAMPLE_REPOS_TAGS configuration key in superset/config.py controls the locations of the examples from which to list and load. This can be set manually using the --examples-repo/-r option. The format of the items are a tuple containing the full repository name (ex. apache-superset/examples-data) and the git tag/branch of the repository to use (ex. master).

In config.py the default entry will look like:

# Tuple format: Gitub repo full name, tag/branch
EXAMPLE_REPOS_TAGS = [
    ('rjurney/examples-data', 'v0.0.3')
]

GITHUB_AUTH_TOKEN Configuration Key

Github rate limits the contents API to 50 anonymous requests per hour. While this is unlikely to affect many users, the limit is by IP address which means users behind proxies or developers may sometimes encounter this. I have added the optional configuration key GITHUB_AUTH_TOKEN which provides a way to add a personal access token to requests from the examples sub-commands. This increases the API limit.

UUIDs via sqlalchemy.types.uuid.UUIDType

In order to export or import assets in a way that doesn’t result in integer primary key chaos, we require that each serialized asset have a unique identifier. The superset.models.helpers.ImportMixin class has been used to provide a uuid field to the following classes:

This required patching FlaskAppBuilder to support UUIDType as a String field type. This will be released with FlaskAppBuilder 2.1.4.

ImportMixin —> ImportExportMixin

I was confused by the role of ImportMixin in model class export, so accordingly I have renamed it to ImportExportMixin.

Command Line Interface

Example capabilities will be accessed via the command line (CLI) interface. The CLI will be changed, removing the load_examples command and replacing it with an examples subcommand with export, list, import and remove commands beneath it.

Top Level CLI Menu

superset —help

Usage: superset [OPTIONS] COMMAND [ARGS]...

  This is a management script for the Superset application.

Options:
  --version  Show the flask version
  --help     Show this message and exit.

Commands:
  db                        Perform database migrations.
  examples                  Manages example chart/dashboards/datasets
  ...
Dashboard Exports Menu

The dashboard exports menu will be extended to add the —dashboard-titles/-t, —export-data/-x and —export-data-dir/-d options which facilitate example export.

superset export_dashboards --help

Usage: superset export_dashboards [OPTIONS]

  Export dashboards to JSON

Options:
  -f, --dashboard-file TEXT    Specify the the file to export to
  -p, --print_stdout           Print JSON to stdout
  -i, --dashboard-ids INTEGER  Specify dashboard id to export
  -t, --dashboard-titles TEXT  Specify dashboard title to export
  -x, --export-data            Export the dashboard's data tables as CSV
                               files.
  -d, --export-data-dir TEXT   Specify export directory path. Defaults to
                               '/tmp'.
  --help                       Show this message and exit.
Examples Top Level Menu

superset examples —-help

Usage: superset examples [OPTIONS] COMMAND [ARGS]...

  Manages example dashboards/datasets

Options:
  --help  Show this message and exit.

Commands:
  export  Export example dashboard/datasets
  list    List example dashboards/datasets
  import  Import an example dashboard/dataset
  remove  Remove an example dashboard/dataset
Example Creation Menu

The examples creation command can be used to export a Dashboard JSON file along with its underlying data tables into a gzipped tarball file. These assets can then be uncompressed in the examples-data project, committed, pushed and then submitted by pull request.

superset examples export --help

Usage: superset examples export [OPTIONS]

  Export example dashboard/datasets

Options:
  -i, --dashboard-id INTEGER  Specify dashboard id to export
  -t, --dashboard-title TEXT  Specify dashboard title to export
  -d, --description TEXT      Description of new example  [required]
  -e, --example-title TEXT    Title for new example  [required]
  -f, --file-name TEXT        Specify export file name. Defaults to
                              dashboard.tar.gz
  -l, --license TEXT          License of the example dashboard
  --help                      Show this message and exit.
Examples List Menu

The examples list command will query the examples-data repository and return a list of available examples along with their metadata. Examples can then be loaded from this list.

superset examples list —-help

Usage: superset examples list [OPTIONS]

  List example dashboards/datasets

Options:
  -r, --examples-repo TEXT  Full name of Github repository containing
                            examples, ex: 'apache-superset/examples-data'
  -r, --examples-tag TEXT   Tag or branch of Github repository containing
                            examples. Defaults to 'master'
  --help                    Show this message and exit.

The output looks table uses prettytable and looks like this:

+-------------------------------+-------------------------------------------------------+-----------+-------+-------+------------------+-----------------------+--------+
|             Title             |                      Description                      | Size (MB) |  Rows | Files |   Created Date   |       Repository      |  Tag   |
+-------------------------------+-------------------------------------------------------+-----------+-------+-------+------------------+-----------------------+--------+
| World Bank Health Information | World Bank Data example about world health populat... |   21.52   | 11770 |   1   | 2019-05-20T16:20 | rjurney/examples-data | v0.0.3 |
+-------------------------------+-------------------------------------------------------+-----------+-------+-------+------------------+-----------------------+--------+
Examples Import Menu

The examples import command will download example metadata and data files from the examples-data repository and will load them into the examples database configured via the SQLALCHEMY_EXAMPLES_URI configuration key or using the value supplied by the —database-uri/-d option.

superset examples load --help

Usage: superset examples load [OPTIONS]

  Load an example dashboard/dataset

Options:
  -d, --database-uri TEXT       Database URI to load example to
  -r, --examples-revision TEXT  Revision of examples to list
  -e, --example-title TEXT      Title of example to load  [required]
  --help                        Show this message and exit.
Examples Remove Menu

The examples remove command will remove the installed example specified from the metadata tables as well as the examples database configured via the SQLALCHEMY_EXAMPLES_URI configuration key or using the value supplied by the —database-uri/-d option.

superset examples remove --help

Usage: superset examples remove [OPTIONS]

  Remove an example dashboard/dataset

Options:
  -e, --example-title TEXT  Title of example to remove  [required]
  -d, --database-uri TEXT   Database URI to load example to
  --help                    Show this message and exit.

New Dependencies

Creating, removing, listing and loading examples can be handled without Git LFS but adding examples to the examples-data Github repository will require it. This is a developer only requirement of the examples-data project and not Superset itself.

Git LFS can be installed via:

# OS X
brew install git-lfs
# port install git-lfs
git lfs install
# Ubuntu
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

Migration Plan and Compatibility

Backwards compatibility will be maintained. Existing dashboard export JSON files will continue to work and all existing dashboard examples and their data files will be ported to the new system and stored in the Superset examples repository.

Cross-Repository Management

As both the examples and Superset evolve, some will work with newer versions of Superset than others. We must strive to keep all of them up to date, but should also try to make them backwards compatible. It will thus be inevitable that incubator-superset releases will have to point to a branch/tag of superset-examples.

A given release of superset must reference a certain version of the examples-data repository. This is achieved via the EXAMPLES_GIT_TAG configuration key. Alternatively, this could be a branch rather than a tag to facilitate the ongoing update of examples.

issue-label-bot[bot] commented 5 years ago

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.94. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

mistercrunch commented 5 years ago

Notes:

rjurney commented 5 years ago

The other parts are addressed in the PR and SIP.

rusackas commented 3 years ago

I'm closing this for a few reasons: • Some of it (import/export, UUIDs) has already been tackled. • It's been open for a long time, without being brought to a DISCUSS thread • It's pretty broad in scope, and may need to be broken down into smaller pieces if we want to carry it through.

@rjurney if you'd like to reopen this, update a little of the context, and perhaps break down the into smaller chunks for discussion/voting/implementation, just say the word, and I/we can re-open it! Thank you for all the hard work and thought that has gone into this, it'll definitely serve as a useful reference for work going forward in this area either way!