apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.94k stars 13.93k forks source link

[SIP-147] A simpler chart API for dashboards #30816

Open betodealmeida opened 3 weeks ago

betodealmeida commented 3 weeks ago

[SIP] A simpler chart API for dashboards

Motivation

Historically, the Apache Superset dashboard has been built on top its interactive, no-code, chart builder (”Explore”). While it makes sense to reuse the chart API endpoints for fetching the data for charts in dashboards, in order to make development faster, at the end of the day these are two different workflows: one is an interactive chart creator, the other is a mostly passive chart consumer.

This distinction has left considerable technical debt behind. In order for a dashboard to load a given chart it needs to perform a POST request to the the chart API, submitting a request payload that includes the query and form_data needed by the endpoint. This data is redundant, since it’s already available to the backend (in fact, it needs to be fetched from the backend before the POST) and attached to the chart ID (which is also passed in the request). This causes the following problems:

  1. Dashboards need to load the full data for all of its charts, instead of just their IDs. This can cause noticeable latency (in our case we’ve observed a latency of ~400ms for larger dashboards, and of course this could potentially be much higher).
  2. For each chart, a POST request is done sending back the data that was just provided by the backend. Not only this is unnecessary, but it opens an attack vector where users can modify the form data in order to fetch unauthorized data — for example, in an embedded dashboard. While this should never be possible by design, there were multiple CVE reports exploiting this channel.
  3. While the POST requests still benefit from the Superset cache (if one is configured), they won’t benefit from other caches along the way (reverse proxies, browser cache). Replacing some of these with GET requests will provide for better dashboard performance and fewer costs, since the caching can be done upstream, as long as response headers are correctly returned by Superset.

Proposed change

Currently, in order to retrieve the data needed to render a chart with ID 74, a dashboard with ID 8 will issue a request like this:

POST https://superset/api/v1/chart/data?form_data={"slice_id":74}&dashboard_id=8

The request payload is not small:

{
  "datasource": {
    "id": 19,
    "type": "table"
  },
  "force": false,
  "queries": [
    {
      "filters": [
        {
          "col": "time_start",
          "op": "TEMPORAL_RANGE",
          "val": "No filter"
        }
      ],
      "extras": {
        "having": "",
        "where": ""
      },
      "applied_time_extras": {},
      "columns": [
        "first_time_developer"
      ],
      "metrics": [
        "count"
      ],
      "orderby": [
        [
          "count",
          false
        ]
      ],
      "annotation_layers": [],
      "row_limit": 100,
      "series_limit": 0,
      "order_desc": true,
      "url_params": {
        "native_filters_key": "UrzGasKJgA6DRj9-qiGtBuAiT0qEW516UqzMPUNLsq9fqY-eS4REkBL_MEcLnYW1"
      },
      "custom_params": {},
      "custom_form_data": {}
    }
  ],
  "form_data": {
    "datasource": "19__table",
    "viz_type": "pie",
    "slice_id": 74,
    "url_params": {
      "native_filters_key": "UrzGasKJgA6DRj9-qiGtBuAiT0qEW516UqzMPUNLsq9fqY-eS4REkBL_MEcLnYW1"
    },
    "groupby": [
      "first_time_developer"
    ],
    "metric": "count",
    "adhoc_filters": [
      {
        "clause": "WHERE",
        "comparator": "No filter",
        "expressionType": "SIMPLE",
        "operator": "TEMPORAL_RANGE",
        "subject": "time_start"
      }
    ],
    "row_limit": 100,
    "sort_by_metric": true,
    "color_scheme": "supersetColors",
    "show_labels_threshold": 5,
    "show_legend": true,
    "legendType": "scroll",
    "legendOrientation": "top",
    "label_type": "key",
    "number_format": "SMART_NUMBER",
    "date_format": "smart_date",
    "show_labels": true,
    "labels_outside": true,
    "outerRadius": 70,
    "innerRadius": 30,
    "dashboards": [
      8
    ],
    "extra_form_data": {},
    "label_colors": {},
    "shared_label_colors": {
      "No Answer": "#1FA8C9",
      "Yes": "#454E7C",
      "No": "#5AC189"
    },
    "extra_filters": [],
    "dashboardId": 8,
    "force": false,
    "result_format": "json",
    "result_type": "full"
  },
  "result_format": "json",
  "result_type": "full"
}

Again: not only the payload is relatively big, but it’s also completely unnecessary, since it has to be requested from the backend before the POST request is made.

If the dashboard has native filters in place, they will be passed as additional filters in the request payload (in bold below):

{
  "datasource": {
    "id": 19,
    "type": "table"
  },
  "queries": [
    {
      "filters": [
        {
          "col": "highest_degree_earned",
          "op": "IN",
          "val": [
            "A. No high school (secondary school)"
          ]
        },
        {
          "col": "time_start",
          "op": "TEMPORAL_RANGE",
          "val": "No filter"
        }
      ],
    }
  ],
  ...
  "form_data": {
    "datasource": "19__table",
    "slice_id": 74,
    ...
    "extra_form_data": {
      "filters": [
        {
          "col": "highest_degree_earned",
          "op": "IN",
          "val": [
            "A. No high school (secondary school)"
          ]
        }
      ]
    }
  }
}

This SIP proposes that chart requests without any applied native filters are to be replaced with GET requests to a new API endpoint:

GET https://superset/api/v1/dashboard/8/chart/74/data

This has multiple benefits:

  1. The chart payload doesn’t need to be downloaded before the request, since with the change only the IDs of the chart in a given dashboard are needed.
  2. The request payload cannot be modified by malicious users.
  3. If the resource is cached by the backend, the API can set the correct HTTP headers so that the browser caches the response. During the lifetime of the cache the browser won’t need to do any network requests at all.

Uses are still able to force-refresh the chart, in which case a simple POST request without a request payload is issued:

POST https://superset/api/v1/dashboard/8/chart/74/data

When the dashboard has native filters applied to the chart, a POST is necessary. In that case, only the additional configuration is passed. For filters:

{
  "queries": [
    {
      "filters": [
        {
          "col": "highest_degree_earned",
          "op": "IN",
          "val": [
            "A. No high school (secondary school)"
          ]
        }
      ]
    }
  ]
}

Or a custom time grain:

{
  "queries": [
    {
      "extras": {
        "time_grain_sqla": "P1M",
        "having": "",
        "where": ""
      }
    }
  ]
}

The backend will then be responsible for merging the request with the saved chart configuration, in order to produce and return the requested data.

New or Changed Public Interfaces

A new API endpoint will be introduced, as described above.

New dependencies

None.

Migration Plan and Compatibility

The original endpoint should remain, both for compatibility with existing clients but mainly because it’s still used by the no-code chart builder (”Explore”). A new feature flag, DASHBOARD_SIMPLE_CHART_API, will be introduced, so that the feature can be tested initially until it’s mature enough to become the default workflow. Eventually, the feature flag should be removed and only the new proposed API should be used.

Rejected Alternatives

This proposal has some overlap with the CLIENT_CACHE feature introduced by the author in Superset 0.34 and removed in 4.0.0, though it’s conceptually simpler since it doesn’t introduce an explicit Javascript caching layer.

Since the proposal is mostly concerned with removing technical debt, as opposed to implementing a new feature, there were no other alternatives considered. Doing GET requests for stored data and POSTing only as much data as needed are aligned with best practices in REST APIs.

We discussed moving the dashboard filters to the URL and use only GET requests, but since most browsers have a limit of ~2000 characters in the URL this would break if too many filters were in place.

rusackas commented 2 weeks ago

Thanks! Please launch a DISCUSS thread on the Dev mailing list if you want to carry this forward ;)

villebro commented 2 weeks ago

I think the simplification makes sense. However, the issue I see with this approach is the fact that the request won't be rendered by the chart plugin, which was the intention of SIP-6. While we do currently store the query context in the chart table as part of the chart metadata, this is in a sense a cached version of what the plugin rendered when the chart was saved. Relying on this cached value makes it impossible to roll out changes to a viz plugin's buildQuery that may have been introduced during a version upgrade, and could lead to dashboards breaking post upgrade. In fact, it can be argued that the introduction of the query_context field to the chart metadata was in fact technical debt that contradicts SIP-6. So as part of this SIP, I feel we should revisit that SIP first, and think carefully of how we can facilitate an architecture, where

CC @michael-s-molina @rusackas as this touches on the plugin architecture discussion we've been having.

michael-s-molina commented 2 weeks ago

I agree with both of you @betodealmeida @villebro. I think the motivation for this SIP is super valid, the way we retrieve dashboard's data is highly inefficient. I also agree that the solution proposed will go against the principles of SIP-6 and would require a revisit of the plugin/server architecture. We are actively discussing a complete redesign of the plugin architecture of Superset. I believe this will be a main focus for us next year so my suggestion would be to wait for the redesign to avoid rework.

betodealmeida commented 2 weeks ago

Argh, good point.

We're going to need that node.js sidecar, aren't we?

villebro commented 2 weeks ago

Yeah, either that, or then we change the architecture to by moving query construction to the backend. However, I suspect that will be a pretty big change, so it's likely not going to happen very soon.