apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.84k stars 13.54k forks source link

Cumsum treats null values as zeros #27245

Open JohnDietrich-Pepper opened 6 months ago

JohnDietrich-Pepper commented 6 months ago

Bug description

Fix #26429 broke our charting because it now treats null values as zeros for the sake of the cumulative sum function. This causes an issue where future time periods that have yet to occur will have data and gaps in data are now being filled instead of showing as gaps.

When the data is null it should be reflected as null NOT zero. If the user wants to fill in or interpolate here that should be a separate option like in other reporting platforms and Excel. It is also easier for the user to insert a IfNull into their query than it is to replicate cumulative sum in a query, particularly if they need to show two different variables with two different rolling functions.

image

How to reproduce the bug

Create a line chart and use the cumulative sum function.

Screenshots/recordings

No response

Superset version

3.1.1

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

villebro commented 6 months ago

@JohnDietrich-Pepper I see the issue.. When implementing this fix in #26429 , we didn't consider the case where values are expected to terminate early, only how to deal with interim values.

I wonder if there's some middle ground here? For instance, automatically interpolating zeros until the last defined value, and then leaving the rest null? This would ensure continuous series, but also make sure they terminate when expected. While I understand the problem described in this issue, the problem described in #21093 is IMO also quite problematic, and likely more common than this one..

JohnDietrich-Pepper commented 6 months ago

@JohnDietrich-Pepper I see the issue.. When implementing this fix in #26429 , we didn't consider the case where values are expected to terminate early, only how to deal with interim values.

I wonder if there's some middle ground here? For instance, automatically interpolating zeros until the last defined value, and then leaving the rest null? This would ensure continuous series, but also make sure they terminate when expected. While I understand the problem described in this issue, the problem described in #21093 is IMO also quite problematic, and likely more common than this one..

@villebro That would probably work but I dispute the notion that #21093 is problematic. The issue in #21093 can be solved in the query with a IFNULL and the solution is disingenuous to the end user because it implies data is in a particular X value when the reality is no data is present. Consider a case of a sensor (for a car, weather, etc...) there is a big difference between the sensor returned no data and the sensor returned a zero value. The solution takes the position that 0 and null are the same.

Real world example. In my day job one of the data sets my data engineering team works with is sensor data that comes off commercial airline flights looking for anomalies in mechanical parts. If we had implemented Superset we would have been in a world of hurt with this change because we need to track the cumulative sum of defects and also periods when the sensor is offline. The original view is correct and what we have implemented in our reporting platform (Tableau)

JohnDietrich-Pepper commented 5 months ago

@villebro Any updates?

villebro commented 5 months ago

@villebro That would probably work but I dispute the notion that #21093 is problematic. The issue in #21093 can be solved in the query with a IFNULL and the solution is disingenuous to the end user because it implies data is in a particular X value when the reality is no data is present. Consider a case of a sensor (for a car, weather, etc...) there is a big difference between the sensor returned no data and the sensor returned a zero value. The solution takes the position that 0 and null are the same.

Actually I'm not so sure using an IFNULL would work here. The whole reason why these gaps appear is because the particular combination of grouping keys doesn't exist. Adding an IFNULL(0) would still leave a NULL value due to the pivot operation. Note, that there has been talk of doing away with the pivot operation for these charts, and that would likely make it easier to offer richer customization options. However, this requires very heavy redesigning of the charting system, and will likely not happen any time soon.

Real world example. In my day job one of the data sets my data engineering team works with is sensor data that comes off commercial airline flights looking for anomalies in mechanical parts. If we had implemented Superset we would have been in a world of hurt with this change because we need to track the cumulative sum of defects and also periods when the sensor is offline. The original view is correct and what we have implemented in our reporting platform (Tableau)

While I understand your need for terminating the cumulative graph, I'm not convinced the common expectation is to see gaps in a cumulative graph where values are missing. I fired a quick "cumulative graph" in Google, and picked the first hit:

image

source: https://blogs.sas.com/content/iml/2020/03/25/read-cumulative-frequency-graph.html

I feel this is a very typical cumulative distribution chart, and leaving gaps in it would IMO not be the expected result. Or maybe I'm misinterpreting what you're saying.

Having said all this, I would still like to see if we can somehow achieve what you're after. Would you be able to share some redacted screenshots of the Tableau chart you mentioned so I can understand how the expected chart should look?