Tile format for histograms

stuartlynn commented 9 years ago

Currently I am using the SQL to do requests for Histogram tiles. The code can be seen here: It builds up a series of arrays of the histogram data and bins and then concatenates these together in the response. The SQL it generates looks like

WITH cte as( 
  select * from stuartlynn.mapsense 
  WHERE stuartlynn.mapsense.the_geom_webmercator && CDB_XYZ_Extent(6,4, 3) ),

cat_id_hist as ( 
  select array_agg(b.bins) as bins, array_agg(b.vals) as vals 
  from (
     select cte.cat_id as bins, count(*) vals 
     from cte group by cat_id
  ) as b), 

requests_hist as( 
    select array_agg(b.bins) as bins, array_agg(b.vals) as vals 
    from(
      select width_bucket(requests,0,20,20) bins ,count(*) vals 
      from cte  group by 1 order by 1
    ) as b)

select cat_id_hist.bins as cat_id_bins, cat_id_hist.vals as cat_id_vals,requests_hist.bins as requests_bins, requests_hist.vals as requests_vals
 from cat_id_hist, requests_hist

Where cat_id is for a category request and requests is for a histogram of a numerical variable. This is an example of the response from the query:

https://gist.github.com/stuartlynn/76e41e9c07ab01d92ee2

It returns a row with two arrays for each variable, one for the bins and the other for the values. In a json response though we probably want to have these formatted as something like:

{
  "mappings":{
      "cat_id":{
         "ids": [1, 33, 4],
         "labels": ["car", "bus", "van"]
      }
  },
  "requests":{
     "bins": [1,2,4],
     "values": [22, 33.3, 44.2] 
  },
  "cat_id":{
    "bins":[1,33,4],
    "values": [4,8, 23]
  }
}

Where the mappings relate the category values to their labels.

@javisantana

javisantana commented 9 years ago

I see some problems:

I miss the boundaries of the histogram for float values, you have the buckets but you need to map de buckets to real values
it's not possible to merge histograms for different tiles, we should have histograms where the buckets have coincident bounds limits. Let me clarify it with an example.

Imagine we have a histogram like this for a tile:

{
bins:     [0, 1, 2, 3, 4, 5],
values: [3, 4, 4, 4, 1]
range: [0.0, 10.0]
}

(so 5 buckets, range 0 -> 10)

and this one for another tile:

{
bins:     [0, 1, 2, 3, 4, 5],
values: [3, 4, 4, 4, 1]
range: [0.0, 2.0]
}

As you see the range is 0.0 -> 2.0 (everything else is the same, it does not mind for this example)

What would be the resulting histogram:

{
bins:     [0, 1, 2, 3, 4, 5],
values: [3 + 16, 4, 4, 4, 1]
range: [0.0, 10.0]
}

We can calculate it for the first bucket because the bucket size in the first histogram is 2.0 (10.0/5) and the second histogram range is inside one bucket. It would work if the second histogram range would be a a multiple of the bin size of the first histogram.

In general everything will work if the bin size for each histogram is:

a multiple of a given bucket size (defined for example by the full range / 64)
the range of the histogram match with the start of a bin bucket (it does not matter what is the size of the bin since the first rule limits the possible amount of bin sizes)

This is harder server side than client side, in client side you just need to do:

look for the tile with histogram with wider bins size
aggregate the other ones summing per bucket

stuartlynn commented 9 years ago

I 100% agree. I didn't get round to implementing it in the code but the plan I had was to pass through the max and min of the column to width_bucket(requests,0,min,max). That would keep the bin sizes consistent. I like your idea of having each bin be an integer multiple of a given amount for the aggregation.

The bins / range format looks fine as well. Lets go with that rather than the float boundaries for the bins.

javisantana commented 9 years ago

I worked on this, this is my proposal:

{
"bins": { "0": 10, "9": 1, ....},
"bounds": [1.1, 3.2],
"zoom": 10,
"x": 4
}

It sounds weird but let me explain a little bit each field:

bins: the count for each bin, max bins is 64, nothing new here
bounds: variable bounds, nothing new
zoom: here is the important part.

The histograms are built in the same way tiles are built, we have the concept of zoom. The zoom 0 has 64 bins, zoom 1 -> 128, zoom 2 -> 256

zoom says what is the zoom level for the bins.

x: x position in the zoom level, the same than tilex, tiley in a tile

The interesting part, how to merge two histograms

we get the min zoom
we transform the histogram with the highest zoom level to the other zoom, level:

function transform_histogram_to(hist, zoom) {
    var new_hist = {
       bins: {}
       bounds: hist.bounds
       zoom: zoom, 
       x: 0
    }
    var zoom_diff = zoom - hist.zoom;
    new_hist.x = hist.x >> zoom_diff;
    // aggregate
    for(var k in h.bins) {
       new_hist[(h.x + k) >> zoom_diff - new_hist.x] += hist.bins[k];
    }
    return new_hist;
}

merge them (untested code)

function merge_histogram(h0, h1) {
    var new_hist = {
       bins: {}
       bounds: [min(h0.bounds[0], h1.bounds[0]), max(h0.bounds[0], h1.bounds[1])],
       zoom: h0.zoom, // zoom should be the same
       x: min(h0.x, h1.x)
    }

    var zoom_levels = 
    // check the range isn't too big
    var bucket_range = max(h0.x, h1.x) - new_hist.x
    if ( bucket_range > 64) {
      // zoom out  
      zoom_levels = ceil(log(float(bucket_range)/64)/log(2));
    }

    new_hist.x = new_hist.x >> zoom_levels;

    for(var k in h0.bins) {
      var idx = (h0.x + k) >> zoom_levels
       if (new_hist[idx]) {  
         new_hist[idx]  = 0   
      }
       new_hist[idx - new_hist.x] += h0.bins[k];
    }

    for(var k in h1.bins) {
      var idx = (h1.x + k) >> zoom_levels
       if (new_hist[idx]) {  
         new_hist[idx]  = 0   
      }
       new_hist[idx - new_hist.x] += h1.bins[k];
    }

    return new_hist;

}

Probably the code does not work and can be improved, it's only to illustrate the thing.

Another possibility, instead of sending the x just have it sum in the bins, like {1020: 3, 1021: 3... }

stuartlynn commented 9 years ago

This looks good. I think it will work in the front end just fine.

CartoDB / torque

Tile format for histograms #229

The interesting part, how to merge two histograms