MVT slowness and fixes - Githubissues

juanignaciosl commented 6 years ago

Extract from an email thread

We hit a performance wall at Tilesets API. While queries that fetch the geometries and data are fast, the generation of the MVT is slow (simpler example). It looks like json generation and string concatenation inside PG is costly. We asked for help from Ramsey, and he just came back (thanks!) with some insights (direct copy from a Slack group):

This is the worst possible way to report out on this, but started doing the profiling and as expected we burn a lot of time in attribute processing… some of it is I think recoverable: the call to get_key_index might be something we can work around. That still leaves the calls to numeric_out, which are turning some numbers into strings. There’s an interesting question: do we maybe have some data types in this query that the MVT system is not handling directly as numbers, but passing out via a string representation. That could be another place to fix things up. Will continue.

screenshot_200

Avoiding the JSONB code path is still inefficient (4x slower than the pre-made JSONB, but 2x faster than a code path that first constructs the JSONB) and also spends most of its time in value handling, though in a different code path than the JSONB one.

screenshot_201

well, just removing one redundant strlen() from the key lookup routine made the jsonb path 20% faster since there’s one in the value handler as well, there might be a big more to squeeze all that numeric data is expensive to handle … the hash table gets very large and each incoming value has to be checked to see if it’s in the dictionary this kind of thing really shows up what MVT was specified for, which was string-tagged OSM data not numeric data once again, goes back to Jaak’s suggestion MVT with ids, efficient attribute API and join on the client side I can squeeze a bit more out of value handling, but it’s a more invasive change than the key handling, unfortunately

I take it back, since most of your time is in numeric handling my change in the string value handling won’t make much difference

As you can see, there's a redesign suggestion: MVT with ids, efficient attribute API and join on the client side.

We'll take a look at this this week (I'm at the Response Team, but Mario is back from vacations). We can also try to move MVT generation from PostgreSQL to Node.

Discussion summary: although we all agree that the current approach is a perverse usage of an MVT, which are not designed to cope with hundred of attributes, there are time constraints that push us towards fixing this issue and being able to serve this kind of MVTs.

Second, into the numbers. I found the results quite surprising. This led me to do a bit of research about encoding properties in MVT tiles. I created a gist with my developments and some numbers, the results are even more surprising: it's faster to do two separate queries, one for the tile with no-properties and another for the properties, and do the merging and encoding of the final MVT tile with all the properties in JavaScript than doing all the operation in PostgreSQL/PostGIS. In my machine, the JavaScript approach takes 1/4 of the time of the PostGIS one.

So, we need to generate MVT faster than we do now. Alternatives:

a) Find a way through current implementation at PG and PGIS that is performant enough. b) It looks like there's some issue at PostGIS, so fixing it there might be the best way. c) Generate the MVT at Node.

Algunenano commented 6 years ago

One of the things I was attributing this slowness before any investigation was the use of json to store data instead of columns, but that can be discarded based on Rochoa's test. Parsing the data from columns in his node implementation is 5x faster than St_AsMVT, which is shaming since it's not even parallel :smile:

rochoa commented 6 years ago

Hahaha. My implementation is the crappiest one but still faster. 🙃

hannahblue commented 6 years ago

As summarized here:

Discussion summary: although we all agree that the current approach is a perverse usage of an MVT, which are not designed to cope with hundred of attributes, there are time constraints that push us towards fixing this issue and being able to serve this kind of MVTs.

We know this approach is not scalable. If Mastercard added a few more categories the number of attributes would explode further. We'll also have even more problems as we add other datasets. Plus, as @pramsey mentioned, separating the geometry and attributes into MVT/attributeAPI also opens up new possibilities on the client like changing attribute styling w/o reloading geometry (which seems like it would have a huge performance benefit). The API we build will need to serve more than just the MRLI application.

We need to figure out how to meet the deliverable in the near term, but also do the proposed redesign.

Would it be an option to keep using the temporary MBtiles on our basemaps infrastructure while we work on the redesign?
Alternatively, what if we go ahead with "fixing this issue and being able to serve this kind of MVTs.", and then implement the redesign?
Other ideas?

@rochoa you said:

with that [redesign] approach we will need to build libraries for different vector map libraries, e.g. Mapbox GL, and, still, that will generate friction when consuming this kind of data in those libraries for some basic exploratory analysis.

Can you explain in more detail what friction you expect?

Algunenano commented 6 years ago

Some notes about performance: Today @antoniocarlon asked me about numerics being autocasted into strings for ST_AsMVT. This is a current pain since it's not obvious that its doing it under the hood.

To avoid this, in Windshaft we first check the column types and manually discard anything that is not a known Mapnik type.

On the other hand, the Bigmetadata project circumvented this issue by casting to JSON. Since the JSON types in Postgresql only store numbers as numeric (no int, double precision, etc), ST_AsMVT had to add an special case for numeric when it is inside a JSON, but this isn't added for the numeric columns and it defaults to string.

Now some numbers:

Outputting numeric as strings in the provided dataset:

EXPLAIN ANALYZE
SELECT St_AsMVT(q) FROM ( SELECT * FROM tilertest.tract9full ) q;
                                                 QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=941.80..941.81 rows=1 width=32) (actual time=2290.162..2290.162 rows=1 loops=1)
->  Seq Scan on tract9full  (cost=0.00..934.64 rows=2864 width=3549) (actual time=0.011..1.315 rows=2864 loops=1)
Planning Time: 6.163 ms
Execution Time: 2290.550 ms

JSON transformations on the fly:

EXPLAIN ANALYZE
SELECT ST_AsMVT(q) FROM (
SELECT  mvt_geometry,
    (SELECT row_to_json(_)::JSONB FROM (
        SELECT 
            geoid,
--- a lot of other columns
) FROM (
    SELECT  *   
    FROM tilertest.tract9full
  ) p
) q;
                                                 QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=991.92..991.93 rows=1 width=32) (actual time=3580.884..3580.885 rows=1 loops=1)
->  Seq Scan on tract9full  (cost=0.00..934.64 rows=2864 width=3549) (actual time=0.012..2.394 rows=2864 loops=1)
SubPlan 1
 ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=1.078..1.079 rows=1 loops=2864)
Planning Time: 8.079 ms
Execution Time: 3582.535 ms

JSON transformations with a table:

CREATE TABLE tilertest.tract9json AS
(
SELECT  mvt_geometry,
    (SELECT row_to_json(_)::JSONB FROM (
        SELECT 
-- Bunch of columns
    ) AS _) AS mvtdata
FROM (
    SELECT  *   
    FROM tilertest.tract9full
  ) p
)

# EXPLAIN ANALYZE SELECT St_AsMVT(q) FROM ( SELECT * FROM tilertest.tract9json ) q;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=122.80..122.81 rows=1 width=32) (actual time=494.006..494.006 rows=1 loops=1)
   ->  Seq Scan on tract9json  (cost=0.00..115.64 rows=2864 width=392) (actual time=0.009..0.694 rows=2864 loops=1)
 Planning Time: 0.187 ms
 Execution Time: 494.877 ms
(4 rows)

Casting the numeric columns on the fly to double precision:

EXPLAIN ANALYZE
SELECT St_AsMVT(q) FROM(
SELECT 
            mvt_geometry,
            geoid,
            XXXXXXX::double precision,
            XXXXXXY::double precision,
            --DO
            name,
--- A lot of other columns
FROM tilertest.tract9full
) q;
                                                 QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=5524.20..5524.21 rows=1 width=32) (actual time=1848.212..1848.212 rows=1 loops=1)
->  Seq Scan on tract9full  (cost=0.00..934.64 rows=2864 width=3549) (actual time=0.010..1.320 rows=2864 loops=1)
Planning Time: 4.385 ms
Execution Time: 1848.595 ms

Create a table with the values as double precision instead of numeric:

CREATE TABLE tilertest.tract9double AS
(
SELECT
            mvt_geometry,
            geoid,
            area_ratio::double precision,
            area::double precision,
            --DO
            sales_state_score_nep_20180102::double precision,
            sales_state_score_sb_20180102
FROM tilertest.tract9full
);

And use this table directly:

EXPLAIN ANALYZE
SELECT St_AsMVT(q) FROM (
SELECT * FROM tilertest.tract9double
) q;
                                                   QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1755.80..1755.81 rows=1 width=32) (actual time=1720.640..1720.640 rows=1 loops=1)
->  Seq Scan on tract9double  (cost=0.00..1748.64 rows=2864 width=5515) (actual time=0.015..1.667 rows=2864 loops=1)
Planning Time: 6.149 ms
Execution Time: 1721.097 ms
(4 rows)

I was expecting to see the same performance here as the json. To discard that this wasn't a double vs numeric issue, since most of the values are integers anyway, I decided to test this also using ints:

CREATE TABLE tilertest.tract9ints AS
(
    SELECT
                mvt_geometry,
                geoid,
                area_ratio::bigint,
                area::bigint,
                --DO
                sales_state_score_nep_20180102::bigint,
                sales_state_score_sb_20180102::bigint,
FROM tilertest.tract9full

EXPLAIN ANALYZE
SELECT St_AsMVT(q) FROM (
SELECT * FROM tilertest.tract9ints
) q;
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1755.80..1755.81 rows=1 width=32) (actual time=1724.212..1724.212 rows=1 loops=1)
   ->  Seq Scan on tract9ints  (cost=0.00..1748.64 rows=2864 width=5518) (actual time=0.013..1.828 rows=2864 loops=1)
 Planning Time: 6.163 ms
 Execution Time: 1724.632 ms
(4 rows)

So it's pretty obvious that either the final MVT aren't the same (I haven't verified them yet) or the JSON path is much faster. I'll keep digging.

BTW, this is still without applying some of the improvements Paul suggested to avoid recalculating the length of the strings.

rochoa commented 6 years ago

I don't know if you dug into my gist, but I used double precision for the measurement* columns.

rochoa commented 6 years ago

Just in case: remember, this is a public repository.

The summary is more complex than that.

We know this approach is not scalable. If Mastercard added a few more categories the number of attributes would explode further. We'll also have even more problems as we add other datasets.

The API is good enough from the visualization standpoint. Abusing the Tilesets API to resolve all the problems the MRLI application has to address is what does not scale well, but that's something we had as a constraint from the very beginning because we landed later in this project. By "abusing" I mean requesting 500+ properties for each geometry.

Plus, as @pramsey mentioned, separating the geometry and attributes into MVT/attribute API also opens up new possibilities on the client like changing attribute styling w/o reloading geometry (which seems like it would have a huge performance benefit).

Opens new possibilities? Yes. Adds more complexity? Yes. Let me elaborate.

The client/developer now needs to handle the join of the geometries and the properties. As a developer building this kind of applications/solutions, for a basic "render some geometries based on 5-10 metrics" use-case, I don't want to be responsible for doing anything more complex than throwing some URLs to my mapping library, being either CARTO VL or Mapbox GL.

Not having to load the geometries more than once might improve the performance but you are delegating other problems to the client that might degrade the performance. To start with, the developer is now responsible for doing the merging, which might lead to other performance problems that we no longer control. The alternative is to provide some libraries/abstractions that handle that complexity, in the case of CARTO VL that could make a lot of sense, but, what do we do for clients building applications with, for instance, Mapbox GL? Furthermore, if the developer has now the responsibility of merging the geometries with the properties, they have to focus on one more thing while building their application, a thing that's not the value of their application.

The API we build will need to serve more than just the MRLI application.

The MRLI application is a complex one, with no restrictions in the amount of data it can consume from the Tilesets API. If we had some limits, and we need to set those, for the number of properties you can request, we weren't talking here about the slowness of requesting 500+ properties. Also, we're assuming that separating geometries from properties will work flawlessly but I bet generating responses for 500+ properties with no geometries and merging them in the client with the existing geometries is not as fast as we are expecting.

Don't take me wrong, I'm not saying the Tilesets API is the answer to every problem. However, we cannot go in the opposite direction because it's not working out as we were desiring. With the best pieces of the Tilesets API and a few other APIs, we can have a very competitive offer in terms of data access for building applications.

Would it be an option to keep using the temporary MBtiles on our basemaps infrastructure while we work on the redesign?

Yes, that's why we did that extra work for buying time in case something arose.

Alternatively, what if we go ahead with "fixing this issue and being able to serve this kind of MVTs.", and then implement the redesign? Other ideas?

I would love to know a bit more about what is really happening with ST_AsMVT before taking any decision about the best approach.

I was writing all of the above before we had our meeting this afternoon. So consider this just a summary of our conversation. We are gonna start working on other applications soon, with that we will learn more about different use-cases, which will make us improve the Tilesets API and, most likely, have other alternatives for accessing the data. We might need a complete redesign or just some tweaks, but instead of taking radical decisions right now, we will do it when we have more information and we feel more prepared.

hannahblue commented 6 years ago

Thanks for summarizing, @rochoa. It's a sensible approach.

One remaining concern I don't think I've managed to articulate well is my concern with the 500+ properties. It could be a big justification for the split approach.

Here's my thinking: There are so many properties because there's a different score for every combination of attributes. This would generally be more scalable as rows rather than columns. Even in the UI there's a drop down for a single selection of each attribute which would work nicely if it were selecting from row values.

Presumably the reason it's columns instead of rows is because the MVT needs one "row" per feature. If we had the split API the attributes could be json, i.e. more richly structured.

To illustrate for anyone not familiar with the data: Here's a simple example (made up data). As columns: screen shot 2018-08-06 at 6 28 02 pm

As rows: screen shot 2018-08-06 at 6 08 14 pm

Right now, there are 5 score types, 3 regions, and 5 categories. What if we got sub categories? "Apparel" is a broad category. What if it was broken down into sub categories like Sporting Goods, Children's, and so on. What if we added other region scores too? From a data perspective these are reasonable changes, but it would have dire affects on the structure as it stands today.

Maybe there's something important I'm missing here. If so let me know!

juanignaciosl commented 6 years ago

Transposing columns to rows adds other problems, such as increasing the number of joins. Anyway, don't forget that data fetching works like a charm with the current approach. The only performance issue that we're facing right now is the generation of the MVT, and it looks like an implementation problem, not an essential one.

Algunenano commented 6 years ago

I'm having a look and reproducing the same behaviour as Paul when using normal columns:

From Postgresql code:

/*
 *      GetAttributeByName
 *      GetAttributeByNum
 *
 *      These functions return the value of the requested attribute
 *      out of the given tuple Datum.
 *      C functions which take a tuple as an argument are expected
 *      to use these.  Ex: overpaid(EMP) might call GetAttributeByNum().
 *      Note: these are actually rather slow because they do a typcache
 *      lookup on each call.
 */

I'm investigating if its possible (exposed by Postgresql headers) and what the performance benefit there is from caching some of the intermediate values avoid recalculation. That or look for an alternative way to do it.

Algunenano commented 6 years ago

Initial work in https://github.com/postgis/postgis/compare/svn-trunk...Algunenano:mvt_fast_parse

Testing with pgbench:

$ cat mc.pgb 
-- File to launch MVT requests with different zoom levels
-- Run with something like:
-- pgbench -c 4 -T 30 -f mc.pgb -U postgres mc_perf

SELECT St_AsMVT(q) FROM (
SELECT * FROM tilertest.tract9double
) q;

Before:

$ pgbench -c 4 -T 30 -f mc.pgb -U postgres mc_perf
starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
transaction type: mc.pgb
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 30 s
number of transactions actually processed: 68
latency average = 1837.633 ms
tps = 2.176713 (including connections establishing)
tps = 2.176942 (excluding connections establishing)

After:

$ pgbench -c 4 -T 30 -f mc.pgb -U postgres mc_perf
starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
transaction type: mc.pgb
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 30 s
number of transactions actually processed: 355
latency average = 341.166 ms
tps = 11.724487 (including connections establishing)
tps = 11.725683 (excluding connections establishing)

I still need to check that everything is under control and add more tests to Postgis, specially for NULL, pointer types (varlena) and user created types (geometry, eventhough it's odd to add it as a property value, but it is possible), but it looks like a 5x improvement just by that :fireworks:

pramsey commented 6 years ago

Nicely done

Algunenano commented 6 years ago

Working under the assumption that the columns from all the rows have the same order, I've added a small cache in the key lookup. For this use case it improves performance even more (almost 10x when combined with the other tweak):

$ pgbench -c 4 -T 30 -f mc.pgb -U postgres mc_perf
starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
transaction type: mc.pgb
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 30 s
number of transactions actually processed: 603
latency average = 199.275 ms
tps = 20.072727 (including connections establishing)
tps = 20.074566 (excluding connections establishing)

I'll need to confirm that the assumption is correct; the tests pass but I'd rather confirm it with something more solid. JSON keys are excluded since those are variable and unordered.

Algunenano commented 6 years ago

And after caching the typeoid's too, avoiding Postgresql lookups (it's slow even with its cache):

$ pgbench -c 4 -T 30 -f mc.pgb -U postgres mc_perf
starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
transaction type: mc.pgb
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 30 s
number of transactions actually processed: 803
latency average = 149.883 ms
tps = 26.687444 (including connections establishing)
tps = 26.690377 (excluding connections establishing)

Probably that's the limit for now it terms of improvement as most of the time can be attributed now to either Postgresql code, necessary conversions, hashing and protobuf.

I'll now try to confirm that the assumption is true, add extra tests around the encoder and check performance of different tiles (with less columns) to make sure there isn't any regression.

Algunenano commented 6 years ago

I've pushed the changes into Postgis trunk. I'll backport them to our 2.4 branch in the next release.

juanignaciosl commented 6 years ago

@Algunenano let us know when this can be upgraded, please :-)

Algunenano commented 6 years ago

While retesting everything with benchmarks, I've found a bug with my column optimization and parallel plans (the resources are only freed in one worker, so the other ones leak them). I'm working on a patch to land before release all the changes in our fork.

Algunenano commented 6 years ago

Postgis 2.4.4.3+carto-1 is released and should be installed in all PG10 production servers. Please let me know if you find any issue.

juanignaciosl commented 6 years ago

@Algunenano great! It's not installed at Tilesets API production, right?

How can I check the exact version? A \dx just says 2.4.4 (I logged into a production DB and it shows the same).

cc @owayss

Algunenano commented 6 years ago

The minor release is the same, so you can either check the package itself:

$ apt list postgis
Listing... Done
postgis/xenial,now 2.4.4.3+carto-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it

Or the build time (not great, but it works):

# Select postgis_lib_build_date();
postgis_lib_build_date 
------------------------
 2018-08-29 13:57:37
(1 row)

@Algunenano great! It's not installed at Tilesets API production, right?

I updated it as always (https://github.com/CartoDB/cartodb-platform/pull/4872) so I understand it should be updated

juanignaciosl commented 6 years ago

Oh, great, I didn't know about postgis_lib_build_date()! It is deployed at Tilesets API, then. No time for a comparison benchmark :sweat_smile:

Thanks!!!

CartoDB / bigmetadata

MVT slowness and fixes #541