Consider new format for passing data in

benmccann commented 4 years ago

Splitting off this discussion from https://github.com/chartjs/Chart.js/pull/6576 where it started.

A few goals:

Be able to skip parsing if the user passes in data in the desired format. E.g. if the user passes in data as numbers (not string or moment) then we should be able to use that data directly. This currently takes about 7% of the time on the uPlot benchmark, but should be able to be skipped
Make determineDataLimits (and getAllTimestamps for the time scale) faster on the x axis. We should just be able to look at the first and last data point. Right now we have to spend substantial time going through each data point of each data series looking for the min and max. This currently takes about 20% of the time on the uPlot benchmark, but should be instantaneous

We could use an array of objects. This option is a bit more self-describing and better handles sparse inputs.

data: [
 {x: 1572981786, y1: 12, y2: 33, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572591743, y1: 33, y2: 22, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572161732, y1: 11, y2: 11, o: 23.04, h: 24.01, l: 22.08, c: 23.01}
],
series: [
  {
    type: 'line',
    label: 'foo',
    data: { y: 'y1' },
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    data: { y: 'y2' },
    order: 5
  },
  {
    type: 'financial', // not specifying data here because using the controller default names
    label: 'voo',
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}

Or an array of arrays. This option would be smaller data transfer over the wire. It also would avoid the user having to care about names (e.g. keeping track of an incrementer to create 'y1', 'y2', etc. and possibly parsing them back out)

data: [
 [1572981786, 12, 33, 23.04, 24.01, 22.08, 23.01],
 [1572591743, 33, 22, 23.04, 24.01, 22.08, 23.01],
 [1572161732, 11, 11, 23.04, 24.01, 22.08, 23.01]
],
series: [
  {
    type: 'line', // not specifying data here because defaulting to x: 0 and y: 1
    label: 'foo',
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    data: { y: 2 },
    order: 5
  }
  financial: {
    type: 'financial',
    label: 'voo',
    data: { o: 3, h: 4, l: 5, c:6 },
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}

I was checking out a couple other libraries to see what they do. HighCharts, GoogleCharts, and uPlot (the names that were at the top of my head) all appear from my cursory glance to want data as an array of arrays and have parsing completely separated from controllers and scales as a step that happens before chart creation. HighCharts and GoogleCharts both give tools to manage parsing and other data manipulations.

leeoniya commented 4 years ago

all appear from my cursory glance to want data as an array of arrays

it's worth noting that uPlot's data format is columnar rather than record/row-based so that it does not require allocating 1e6 tiny arrays for 1e6 datapoints, and avoids duplicating x values for each series by requiring them all to be x-coalesced. it can support a dense format like this because it's strictly a line plotter. if it had to support something like scatter plots, i'm not sure the format could retain its current density/efficiency.

EDIT: i'd probably use a similarly dense array for scatter, with a max of 5 arrays per series:

[
  [
    [x1,x2,x3,x4],
    [y1,y2,y3,y4],
    [v1,v2,v3,v4],
    [label1,label2,label3,label4],
  ],
  [
    [x5,x6,x7],
    [y5,y6,y7],
    [v5,v6,v7],
    [label5,label6,label7],
  ]
]

kurkle commented 4 years ago

It might be more intuitive to use map (or something else) instead if data in series:


data: [
 {x: 1572981786, y1: 12, y2: 33, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572591743, y1: 33, y2: 22, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572161732, y1: 11, y2: 11, o: 23.04, h: 24.01, l: 22.08, c: 23.01}
],
series: [
  {
    type: 'line',
    label: 'foo',
    map: { y: 'y1' },
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    map: { y: 'y2' },
    order: 5
  }
  financial: {
    type: 'financial', // not specifying data here because using the controller default mapping
    label: 'voo',
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}

benmccann commented 4 years ago

map is a bit generic. Maybe we could find a slightly more descriptive term?

Passing data in column-wise is a pretty interesting idea. I imagine that would also make better use of memory locality.

Passing in data at the controller / series level probably better supports the sparse scatter chart case. It would also better support the case where there are multiple x-axes (it's unclear to me if that's an actual usecase that would occur, but I suppose it's slightly more flexible)

const data = [
 [1572981786, 1572591743, 1572161732],
 [12, 33, 11],
 [33, 22, 11],
 [23.04, 23.04, 23.04],
 [22.08, 22.08, 22.08],
 [23.01, 23.01, 23.01],
];

new Chart({series: [
  {
    type: 'line',
    label: 'foo',
    data: [
      [1572947586, 1572594965],
      [42, 31],
    ],
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    data: data,
    input: { y: 2 },
    order: 5
  },
  {
    type: 'financial',
    label: 'voo',
    data: data,
    input: { o: 3, h: 4, l: 5, c:6 },
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}});

benmccann commented 4 years ago

Another benefit of passing data in column-wise is that scales are much better able to utilize it. E.g. the time scale needs an array of the index values. This would be trivial if data were passed in column-wise

benmccann commented 4 years ago

Another thought that I had on this is that, now that parsing is confined solely to the controller, each controller could specify their own format. I think for the core library we could probably use one format exclusively, but if someone wanted to build a custom controller I think it would probably be possible for them to specify their own data format

kurkle commented 4 years ago

In many use cases, data comes from db, where is is usually in rows. Filling missing values can be annoying. Plotting two series, one with 1 ms precision and other 1 month precision to same chart with columnar data would be a quite silly (assuming both are configured in same array of arrays).

So I don't quite like the columnar data spec for this kind of configuration diversity Chart.js already offers.

leeoniya commented 4 years ago

you don't technically need to go whole-hog-columnar like uPlot does. there's still a good amount of savings by doing columnar-per-series (as the scatter variant i pondered above: https://github.com/chartjs/Chart.js/issues/6696#issuecomment-550072171).

In many use cases, data comes from db, where is is usually in rows.

this is definitely true. but even if you have to spend a bit of time and mem to convert the data up-front, you could potentially release the original structure and have a significantly smaller retained mem footprint. however, this mainly has an out-sized effect on data-heavy line & time-series charts, with scatter a far second, and the rest a very distant third.

benmccann commented 4 years ago

I tend to think that keeping datasets at the series level would be easier for the user, independent of whether we choose to go columnar or object/rows. E.g. if you have a dataset with month precision and a dataset with year precision and want to plot them both, they're not necessarily in the same dataset. So to have a single dataset at the chart level you'd have to do some type of join operation (Google Charts actually provides a utility for doing joins).

If we do decide to go with per-controller datasets then you don't have to worry about filling missing values, etc. so there's not much downside to columnar datasets

kurkle commented 4 years ago

Actually we are kind of moving away from columnar already, but still support it:

labels: ['a','b','c'],
datasets: [{ data: [1,2,3]}, {data: [4,5,6]}]

is equivalent to:

data: [
['a','b','c']
[1,2,3]
[4,5,6]
]

@leeoniya consider a regression line on a typical uPlot. its not efficiently stored.

benmccann commented 4 years ago

The second example you just gave I think is much nicer from a user perspective because you can send a single data structure back from the server and don't need to deconstruct it and pass the input to the right places

I think the reason you're suggesting uPlot doesn't store regression lines efficiently is that it must specify data for every index, but really you need only two points to store a line. However, if the dataset is passed in at the series-level, so that you can still have multiple datasets then I think columnar formats do not have this drawback. For the regression you could just pass a single 2x2 array: [[x1,x2],[y1,y2]]

leeoniya commented 4 years ago

in my view, regression lines are neither data nor a series; they're more akin to annotations, so i would not personally store them alongside the actual data anyways - they'd be drawn separately if e.g. r2: "linear" was specified in the series' config.

uPlot actually bounced back and forth between storing the data inside vs outside the series config before settling on keeping it separate basically for the reason that @benmccann mentions - you can stream data from the server and pass it directly to some .setData() in the same format as provided initially without having to wrap it in a series config struct.

kurkle commented 4 years ago

you can stream data from the server and pass it directly to some .setData() in the same format as provided initially without having to wrap it in a series config struct.

I kind of agree.

I think @benmccann is getting in good direction in https://github.com/chartjs/Chart.js/issues/6696#issuecomment-553138807.

I'd extend a bit more (you can still use array and indices instead of object and properties):

const data = {
 time: [1572981786, 1572591743, 1572161732],
 time2: [1572981786, 1572161732],
 bar: [12, 33, 11],
 line: [1, 2],
 o: [33, 22, 11],
 h: [23.04, 23.04, 23.04],
 l: [22.08, 22.08, 22.08],
 c: [23.01, 23.01, 23.01],
 scatter: [{x: 10, y: 20}, {x: 20, y: -2}]
};

new Chart({
  data: data,
  series: [
  {
    type: 'line',
    data: [
      [1572947586, 1572594965],
      [42, 31],
    ],
    input: { x: 0, y: 1}
  }, 
  {
    type: 'line',
    data: [
      x: [1572947586, 1572594965],
      y: [42, 31],
    ],
  }, 
  {
    type: 'line',
    data: [{x: 1572947586, y: 42}, { x: 1572594965, y: 31}],
  }, 
  {
    type: 'bar',
    input: { x: 'time', y: 'bar' },
  },
  {
    type: 'bar',
    input: { x: 'time2', y: 'line' },
  },
  {
    type: 'bar',
    data: [1, 2, 3],
    axes: { x: 'cat' }
  },
  {
    type: 'bar',
    data: { x: ['before', 'in between', after'], y: [ 5, 4, 3]},
    axes: { x: 'cat' }
  },
  {
    type: 'scatter',
    input: 'scatter'
  }
  {
    type: 'financial',
    data: data, 
    input: { x: 'time' }, // o, h, c and l would be defaults
  }
],
scales: {
  x: {
    type: 'time',
  },
  y: {
    type: 'linear'
  },
  cat: {
    type: 'category',
    labels: ['a', 'b', 'c'],
    position: 'top'
  },
  ord: {
    type: 'ordinal',
    position: 'bottom'
  }
}});

benmccann commented 4 years ago

@kurkle I like your suggestion. I'm not sure I've thought through all the edge cases yet, but my initial reaction is that it should work well.

If we accept input in multiple formats, we'll still have to store it internally in a single format after parsing. What do you think about using columnar storage internally?

kurkle commented 4 years ago

It really does not make sense to me, but I might be biased. Would have to make a draft to work out the pros and cons.

etimberg commented 4 years ago

One thing I like about the current format is that it is very easy to understand. Perhaps we can do an internal transformation at the start of create / update. We could parse out min, max, etc and keep it as meta data. That would make the internal uses easier.

As an aside, I don't entirely follow what the o, l, h, and c properties are in the examples above

kurkle commented 4 years ago

open, high, low and close (ohlc)

benmccann commented 4 years ago

After playing around a bit and thinking about this a little more, I tend to think that storing the data column-wise internally wouldn't be much of an advantage.

benmccann commented 4 years ago

I do wonder though whether we should store each row as an array or an object

Arrays would be smaller to transfer over the wire. I expect they'd also take less memory

Surprisingly when I did a small benchmark for access only it seemed like an objects won out: https://jsperf.com/chartjs-internal-storage I wouldn't have expected that to be the case

benmccann commented 4 years ago

I'm closing this since we seem to be in a pretty good place now in terms of performance

chartjs / Chart.js

Consider new format for passing data in #6696