Improve efficiency of query samples (esp. for FDW's)

dgaubert commented 5 years ago

Fixes #1118

Note: the new sample implementation may be inefficient when the numeric ID column (cartodb_id) has many (or moderately many) gaps.

Algunenano commented 5 years ago

Improve efficiency of query samples (esp. for FDW's)

Do you have any benchmarks? What's the expected impact in different scenarios (small vs big tables, compact vs sparse ids)?

dgaubert commented 5 years ago

While performing benchmarks, I found this error:

Branch: master, instantiating a map using a foreign table.

{
    "errors": [
        "TABLESAMPLE clause can only be applied to tables and materialized views"
    ],
    "errors_with_context": [
        {
            "type": "unknown",
            "message": "TABLESAMPLE clause can only be applied to tables and materialized views"
        }
    ]
}

So TABLESAMPLE BERNOULLI doesn't work with foreign tables. Going to hijack the code to avoid using it just for testing purposes.

dgaubert commented 5 years ago

Benchmark

Map Config

{
    "version": "1.8.0",
    "layers": [
        {
            "type": "mapnik",
            "options": {
                "sql": "select * from {table|foreign_table}",
                "cartocss_version": "2.3.0",
                "cartocss": "#layer{marker-placement:point;marker-allow-overlap:true;marker-line-opacity:0.2;marker-line-width:0.5;marker-opacity:1;marker-width:5;marker-fill:red;}",
                "metadata": {
                    "sample": {
                        "num_rows": 1000
                    }
                }
            }
        }
    ]
}

Branch: master (using random() method to get sample)

Foreign table (800K points), map instantiation:

$ wrk -s ow.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30 -t 1 -c 1
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
1 threads and 1 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
Latency     0.00us    0.00us   0.00us    -nan%
Req/Sec     0.00      0.00     0.00    100.00%
6 requests in 30.08s, 1.90MB read
Socket errors: connect 0, read 0, write 0, timeout 6
Requests/sec:      0.20
Transfer/sec:     64.76KB

Note: I used 1 thread / 1 connection, otherwise the benchmark doesn't get any response.

Local table (800K points), map instantiation:

$ wrk -s ow_local.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30 -t 1 -c 1
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
1 threads and 1 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
Latency   100.12ms   13.09ms 194.29ms   85.00%
Req/Sec    10.09      2.33    20.00     91.96%
300 requests in 30.04s, 95.15MB read
Requests/sec:      9.99
Transfer/sec:      3.17MB

$ wrk -s ow_local.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   668.43ms  181.12ms   1.51s    81.03%
    Req/Sec    10.56      8.42    40.00     68.50%
  445 requests in 30.10s, 141.13MB read
Requests/sec:     14.78
Transfer/sec:      4.69MB

Branch: improve-metadata-sample

Foreign table (800K points), map instantiation:

$ wrk -s ow.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30 -t 1 -c 1
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
1 threads and 1 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
Latency    70.90ms   18.08ms 284.70ms   91.86%
Req/Sec    14.28      5.10    20.00     52.38%
426 requests in 30.03s, 135.15MB read
Requests/sec:     14.18
Transfer/sec:      4.50MB

$ wrk -s ow.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   509.24ms  200.36ms   1.98s    87.22%
    Req/Sec    13.35      9.12    40.00     70.61%
  602 requests in 30.09s, 190.99MB read
Requests/sec:     20.00
Transfer/sec:      6.35MB

Local table (800K points), map instantiation:

$ wrk -s ow_local.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30 -t 1 -c 1
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
1 threads and 1 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
Latency    56.06ms   31.41ms 271.20ms   89.84%
Req/Sec    19.37      7.09    30.00     53.17%
567 requests in 30.05s, 179.93MB read
Requests/sec:     18.87
Transfer/sec:      5.99MB

$ wrk -s ow_local.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   371.73ms  103.41ms   1.02s    80.42%
    Req/Sec    14.39      7.03    40.00     83.97%
  806 requests in 30.10s, 255.79MB read
Requests/sec:     26.77
Transfer/sec:      8.50MB

cc/ @Algunenano

dgaubert commented 5 years ago

Corner cases

I was wondering how this branch behaves with datasets with gaps in the primary key sequence (cartodb_id). I performed some tests:

Map Config

{
    "version": "1.8.0",
    "layers": [
        {
            "type": "mapnik",
            "options": {
                "sql": "select * from {ow_odd|ow_gap}",
                "cartocss_version": "2.3.0",
                "cartocss": "#layer{marker-placement:point;marker-allow-overlap:true;marker-line-opacity:0.2;marker-line-width:0.5;marker-opacity:1;marker-width:5;marker-fill:red;}",
                "metadata": {
                    "sample": {
                        "num_rows": 1000
                    }
                }
            }
        }
    ]
}

Datasets with many small gaps in `cartodb_id` sequence (50% sequence is used)

$ curl -X POST -H 'Host: cdb.localhost.lan' -H 'Content-Type: application/json' -d @ow_local_odd.json "http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d" > ow_local_odd_layergroup.json
$ node
> const l = require('./ow_local_gap_layergroup.json')
undefined
> l.metadata.layers[0].meta.stats.sample.length
496

Datasets with one big gap in `cartodb_id` sequence (50% sequence is used)

$ curl -X POST -H 'Host: cdb.localhost.lan' -H 'Content-Type: application/json' -d @ow_local_gap.json "http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d" > ow_local_gap_layergroup.json
ubuntu@cdb-dev:/vagrant/utils/live-connectors/sample$ node
> const l = require('./ow_local_gap_layergroup.json')
undefined
> l.metadata.layers[0].meta.stats.sample.length
539

As expected, it's able to get the ~50% of the requested number of rows. Such kind of scenarios are unusual in our infrastructure but we must check if returning fewer data to Carto-VL is enough to work with.

dgaubert commented 5 years ago

Took a look to CartoVL and it doesn't seem to check anything with the sample, it just uses it to create GlobalPercentile, GlobalQuantiles, GlobalStandardDev, and GlobalHistogram.

CartoDB / Windshaft-cartodb