CartoDB / torque

Temporal mapping for CARTO
http://cartodb.github.com/torque/
BSD 4-Clause "Original" or "Old" License
397 stars 129 forks source link

Tile format for histograms #229

Open stuartlynn opened 9 years ago

stuartlynn commented 9 years ago

Currently I am using the SQL to do requests for Histogram tiles. The code can be seen here: It builds up a series of arrays of the histogram data and bins and then concatenates these together in the response. The SQL it generates looks like

WITH cte as( 
  select * from stuartlynn.mapsense 
  WHERE stuartlynn.mapsense.the_geom_webmercator && CDB_XYZ_Extent(6,4, 3) ),

cat_id_hist as ( 
  select array_agg(b.bins) as bins, array_agg(b.vals) as vals 
  from (
     select cte.cat_id as bins, count(*) vals 
     from cte group by cat_id
  ) as b), 

requests_hist as( 
    select array_agg(b.bins) as bins, array_agg(b.vals) as vals 
    from(
      select width_bucket(requests,0,20,20) bins ,count(*) vals 
      from cte  group by 1 order by 1
    ) as b)

select cat_id_hist.bins as cat_id_bins, cat_id_hist.vals as cat_id_vals,requests_hist.bins as requests_bins, requests_hist.vals as requests_vals
 from cat_id_hist, requests_hist

Where cat_id is for a category request and requests is for a histogram of a numerical variable. This is an example of the response from the query:

https://gist.github.com/stuartlynn/76e41e9c07ab01d92ee2

It returns a row with two arrays for each variable, one for the bins and the other for the values. In a json response though we probably want to have these formatted as something like:

{
  "mappings":{
      "cat_id":{
         "ids": [1, 33, 4],
         "labels": ["car", "bus", "van"]
      }
  },
  "requests":{
     "bins": [1,2,4],
     "values": [22, 33.3, 44.2] 
  },
  "cat_id":{
    "bins":[1,33,4],
    "values": [4,8, 23]
  }
}

Where the mappings relate the category values to their labels.

@javisantana

javisantana commented 9 years ago

I see some problems:

Imagine we have a histogram like this for a tile:

{
bins:     [0, 1, 2, 3, 4, 5],
values: [3, 4, 4, 4, 1]
range: [0.0, 10.0]
}

(so 5 buckets, range 0 -> 10)

and this one for another tile:

{
bins:     [0, 1, 2, 3, 4, 5],
values: [3, 4, 4, 4, 1]
range: [0.0, 2.0]
}

As you see the range is 0.0 -> 2.0 (everything else is the same, it does not mind for this example)

What would be the resulting histogram:

{
bins:     [0, 1, 2, 3, 4, 5],
values: [3 + 16, 4, 4, 4, 1]
range: [0.0, 10.0]
}

We can calculate it for the first bucket because the bucket size in the first histogram is 2.0 (10.0/5) and the second histogram range is inside one bucket. It would work if the second histogram range would be a a multiple of the bin size of the first histogram.

In general everything will work if the bin size for each histogram is:

This is harder server side than client side, in client side you just need to do:

stuartlynn commented 9 years ago

I 100% agree. I didn't get round to implementing it in the code but the plan I had was to pass through the max and min of the column to width_bucket(requests,0,min,max). That would keep the bin sizes consistent. I like your idea of having each bin be an integer multiple of a given amount for the aggregation.

The bins / range format looks fine as well. Lets go with that rather than the float boundaries for the bins.

javisantana commented 9 years ago

I worked on this, this is my proposal:

{
"bins": { "0": 10, "9": 1, ....},
"bounds": [1.1, 3.2],
"zoom": 10,
"x": 4
}

It sounds weird but let me explain a little bit each field:

The histograms are built in the same way tiles are built, we have the concept of zoom. The zoom 0 has 64 bins, zoom 1 -> 128, zoom 2 -> 256

zoom says what is the zoom level for the bins.

The interesting part, how to merge two histograms

function transform_histogram_to(hist, zoom) {
    var new_hist = {
       bins: {}
       bounds: hist.bounds
       zoom: zoom, 
       x: 0
    }
    var zoom_diff = zoom - hist.zoom;
    new_hist.x = hist.x >> zoom_diff;
    // aggregate
    for(var k in h.bins) {
       new_hist[(h.x + k) >> zoom_diff - new_hist.x] += hist.bins[k];
    }
    return new_hist;
}
function merge_histogram(h0, h1) {
    var new_hist = {
       bins: {}
       bounds: [min(h0.bounds[0], h1.bounds[0]), max(h0.bounds[0], h1.bounds[1])],
       zoom: h0.zoom, // zoom should be the same
       x: min(h0.x, h1.x)
    }

    var zoom_levels = 
    // check the range isn't too big
    var bucket_range = max(h0.x, h1.x) - new_hist.x
    if ( bucket_range > 64) {
      // zoom out  
      zoom_levels = ceil(log(float(bucket_range)/64)/log(2));
    }

    new_hist.x = new_hist.x >> zoom_levels;

    for(var k in h0.bins) {
      var idx = (h0.x + k) >> zoom_levels
       if (new_hist[idx]) {  
         new_hist[idx]  = 0   
      }
       new_hist[idx - new_hist.x] += h0.bins[k];
    }

    for(var k in h1.bins) {
      var idx = (h1.x + k) >> zoom_levels
       if (new_hist[idx]) {  
         new_hist[idx]  = 0   
      }
       new_hist[idx - new_hist.x] += h1.bins[k];
    }

    return new_hist;

}

Probably the code does not work and can be improved, it's only to illustrate the thing.

Another possibility, instead of sending the x just have it sum in the bins, like {1020: 3, 1021: 3... }

stuartlynn commented 9 years ago

This looks good. I think it will work in the front end just fine.