CartoDB / Windshaft-cartodb

Windshaft tailored for CARTO
BSD 3-Clause "New" or "Revised" License
72 stars 58 forks source link

Improve efficiency of query samples (esp. for FDW's) #1120

Closed dgaubert closed 5 years ago

dgaubert commented 5 years ago

Fixes #1118

Note: the new sample implementation may be inefficient when the numeric ID column (cartodb_id) has many (or moderately many) gaps.

Algunenano commented 5 years ago

Improve efficiency of query samples (esp. for FDW's)

Do you have any benchmarks? What's the expected impact in different scenarios (small vs big tables, compact vs sparse ids)?

dgaubert commented 5 years ago

While performing benchmarks, I found this error:

{
    "errors": [
        "TABLESAMPLE clause can only be applied to tables and materialized views"
    ],
    "errors_with_context": [
        {
            "type": "unknown",
            "message": "TABLESAMPLE clause can only be applied to tables and materialized views"
        }
    ]
}

So TABLESAMPLE BERNOULLI doesn't work with foreign tables. Going to hijack the code to avoid using it just for testing purposes.

dgaubert commented 5 years ago

Benchmark

Map Config

{
    "version": "1.8.0",
    "layers": [
        {
            "type": "mapnik",
            "options": {
                "sql": "select * from {table|foreign_table}",
                "cartocss_version": "2.3.0",
                "cartocss": "#layer{marker-placement:point;marker-allow-overlap:true;marker-line-opacity:0.2;marker-line-width:0.5;marker-opacity:1;marker-width:5;marker-fill:red;}",
                "metadata": {
                    "sample": {
                        "num_rows": 1000
                    }
                }
            }
        }
    ]
}

Branch: master (using random() method to get sample)

$ wrk -s ow_local.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   668.43ms  181.12ms   1.51s    81.03%
    Req/Sec    10.56      8.42    40.00     68.50%
  445 requests in 30.10s, 141.13MB read
Requests/sec:     14.78
Transfer/sec:      4.69MB

Branch: improve-metadata-sample

$ wrk -s ow.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   509.24ms  200.36ms   1.98s    87.22%
    Req/Sec    13.35      9.12    40.00     70.61%
  602 requests in 30.09s, 190.99MB read
Requests/sec:     20.00
Transfer/sec:      6.35MB
$ wrk -s ow_local.lua http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d -d 30
Running 30s test @ http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   371.73ms  103.41ms   1.02s    80.42%
    Req/Sec    14.39      7.03    40.00     83.97%
  806 requests in 30.10s, 255.79MB read
Requests/sec:     26.77
Transfer/sec:      8.50MB

cc/ @Algunenano

dgaubert commented 5 years ago

Corner cases

I was wondering how this branch behaves with datasets with gaps in the primary key sequence (cartodb_id). I performed some tests:

Map Config

{
    "version": "1.8.0",
    "layers": [
        {
            "type": "mapnik",
            "options": {
                "sql": "select * from {ow_odd|ow_gap}",
                "cartocss_version": "2.3.0",
                "cartocss": "#layer{marker-placement:point;marker-allow-overlap:true;marker-line-opacity:0.2;marker-line-width:0.5;marker-opacity:1;marker-width:5;marker-fill:red;}",
                "metadata": {
                    "sample": {
                        "num_rows": 1000
                    }
                }
            }
        }
    ]
}

Datasets with many small gaps in cartodb_id sequence (50% sequence is used)

$ curl -X POST -H 'Host: cdb.localhost.lan' -H 'Content-Type: application/json' -d @ow_local_odd.json "http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d" > ow_local_odd_layergroup.json
$ node
> const l = require('./ow_local_gap_layergroup.json')
undefined
> l.metadata.layers[0].meta.stats.sample.length
496

Datasets with one big gap in cartodb_id sequence (50% sequence is used)

$ curl -X POST -H 'Host: cdb.localhost.lan' -H 'Content-Type: application/json' -d @ow_local_gap.json "http://0.0.0.0:8181/api/v1/map/?api_key=5d56****473d" > ow_local_gap_layergroup.json
ubuntu@cdb-dev:/vagrant/utils/live-connectors/sample$ node
> const l = require('./ow_local_gap_layergroup.json')
undefined
> l.metadata.layers[0].meta.stats.sample.length
539

As expected, it's able to get the ~50% of the requested number of rows. Such kind of scenarios are unusual in our infrastructure but we must check if returning fewer data to Carto-VL is enough to work with.

dgaubert commented 5 years ago

Took a look to CartoVL and it doesn't seem to check anything with the sample, it just uses it to create GlobalPercentile, GlobalQuantiles, GlobalStandardDev, and GlobalHistogram.