cockroachdb / movr

A fictional ride sharing company.
Apache License 2.0
33 stars 14 forks source link

Schema of vehicle_location_histories is not suitable for archival partitioning #75

Open jseldess opened 5 years ago

jseldess commented 5 years ago

I've been trying out archival partitioning, and when I try to partition the vehicle_location_histories table based on timestamp, I get this error:

root@10.142.0.78:26257/movr> ALTER TABLE vehicle_location_histories PARTITION BY RANGE (timestamp) (PARTITION archived_locations VALUES FROM (MINVALUE) TO ('2019-03-16 02:12:50.759879+00:00'), PARTITION archived_locations VALUES FROM ('2019-03-16 02:12:50.759879+00:00') TO (MAXVALUE));
pq: declared partition columns (timestamp) do not match first 1 columns in index being partitioned (city)

Looking at the table schema, the primary key includes city, then ride_id, then timestamp:

root@10.142.0.78:26257/movr> show create table vehicle_location_histories;
          table_name         |                                     create_statement
+----------------------------+-------------------------------------------------------------------------------------------+
  vehicle_location_histories | CREATE TABLE vehicle_location_histories (
                             |     city VARCHAR NOT NULL,
                             |     ride_id UUID NOT NULL,
                             |     "timestamp" TIMESTAMP NOT NULL,
                             |     lat FLOAT8 NULL,
                             |     long FLOAT8 NULL,
                             |     CONSTRAINT "primary" PRIMARY KEY (city ASC, ride_id ASC, "timestamp" ASC),
                             |     CONSTRAINT fk_city_ref_rides FOREIGN KEY (city, ride_id) REFERENCES rides (city, id),
                             |     FAMILY "primary" (city, ride_id, "timestamp", lat, long)
                             | )
(1 row)

I don't understand why city and ride_id need to be in the primary key. If it were just timestamp, I think this would work fine. @nstewart, let me know what you were thinking.

jseldess commented 5 years ago

Or the primary key could contain all 3 columns but start with timestamp. Whatever column you're partitioning on needs to be first, I believe.

jseldess commented 5 years ago

There are actually a few open questions for me, and I'd like your perspectives, @bdarnell and @awoods187:

  1. If we were to use MovR for an archival-partitioning-only demonstration, for example, in a single-datacenter deployment, where partitioning by city doesn't make sense, then I think the schema of this table only needs timestamp in the primary key. Thoughts?
  2. However, if we were to use MovR for geo-partitioning, for example, a multi-region deployment, would it also be possible to do archival partitioning just on this table? I don't really see how, as we would want the table to be geo-partitioned by city, but we would want it also to be archival-partitioned by timestamp.

Also wondering, for 1, if it might be better to use a different column in vehicle_location_histories as the archival-partitioning key, something like present_location as a boolean. Because with timestamp, you don't really have a way to know whether the location is current or historical. A vehicle might go unused for a while, so picking some timestamp (or interval before now()) as the archival decision will be inefficient in some cases.

bdarnell commented 5 years ago
  1. Right.

  2. Combining multiple kinds of partitioning is complicated. You have to nest one inside the other, probably making the PK (city, timestamp) and then performing archival partitioning separately in each geo-partition.

I think what you're describing with the present_location boolean is something different from archival partitioning. I'd do what I think you're describing with two tables: a present-location table that doesn't attempt to retain any history and a history table with archival partitioning that doesn't attempt to worry about the present.

nstewart commented 5 years ago

For #1, you still wouldn't want just timestamp because that would result in a hotspot. Either way, I would at least put vehicle id first. cc @bdarnell for additional thoughts

@jseldess re: present location - that is currently modeled on the vehicle itself as current_location https://github.com/cockroachdb/movr/blob/master/models.py#L71 -- it's a string address at the moment, but we could change that to a lat long to tie out with the new locations table

bdarnell commented 5 years ago

Archival partitioning kind of inherently creates hotspots - if you put vehicle ID first then you no longer have the time-ordering that allows you to create partitions based on age. The general solution to this kind of hotspot is to add a shard_id to the key, but in archival partitioning that would require partitioning by a combination of list and range which we don't appear to currently support (and even if we did, this would be a lot of manual work).

I'm not sure archival partitioning makes a lot of sense without some sort of support for sharded timestamp indexes (either directly or through a generalization of partitioning to combine range and list types at once). As things stand now, you can only use archival partitioning when your write volume is low enough that you can handle the hotspot, but then are you going to have enough data that partitioning will make much of a difference?

jseldess commented 5 years ago

I'm confused about the right scenario for demonstration purposes, but in the simplest scenario, with archival-partitioning by range via timestamp, wouldn't load-based splitting help with the hotspot issue?

jseldess commented 5 years ago

I just read this range partitioning caveat as well:

Partition values can be any SQL expression, but it’s only evaluated once. If you create a partition with value < (now() - '1d') on 2017-01-30, it would be contain all values less than 2017-01-29. It would not update the next day, it would continue to contain values less than 2017-01-29.

Doesn't this limitation make it pretty hard to use archival partitioning in a meaningful way? Wouldn't you want to archive data based on its age, and not based on some pre-defined date? I suppose you could update the archival partition with a new archival date every so often, but that seems burdensome.

nstewart commented 5 years ago

Load-based splitting wouldn't help with the hotspot issue with a timestamp ordered index. The QPS / range would never be distributed with auto spits because we would always write to the top (or bottom) range in the timestamp sorted map.

It would help with a shard key partitioned schema though.

It sounds like this feature would be experimental or beta by our current standards. I'll add it to airtable as a placeholder. cc @awoods187

bdarnell commented 5 years ago

Doesn't this limitation make it pretty hard to use archival partitioning in a meaningful way? Wouldn't you want to archive data based on its age, and not based on some pre-defined date? I suppose you could update the archival partition with a new archival date every so often, but that seems burdensome.

Yes, archival partitioning currently requires you to periodically upgrade your partition boundaries.

It [load-based splitting] would help with a shard key partitioned schema though.

Not really - the partitioning would ensure that there are splits at the shard key boundaries, and then within each shard you're back to the single hotspot that can't usefully be split due to load. Load-based splitting would help for sharded timestamp keys without partitioning, though.

It sounds like this feature would be experimental or beta by our current standards.

Archival is a use case, the feature is partitioning by range. I agree that addressing archival needs with range-based partitioning is an experimental scenario. I'm not sure that this implies that range-based partitioning is also experimental, although I don't think we have any ideas for a non-time-based use of the feature.

Actually, archival partitioning may be two separate use cases, each experimental in their own ways: a need for efficiently dropping older data (which is probably better served by implementing row-based TTLs) and distinguishing hot and cold storage (the canonical example of this is to put hot data on SSD and cold on HDD, but with our current limits on ranges per store, I don't think we can make effective use of HDDs).

nstewart commented 5 years ago

Not really - the partitioning would ensure that there are splits at the shard key boundaries, and then within each shard you're back to the single hotspot that can't usefully be split due to load. Load-based splitting would help for sharded timestamp keys without partitioning, though.

Right - I didn't mean partitioning as in the feature. I just meant "splitting up" the timeseries data by virtue of having a shard key. Wrong word choice!

Thanks for the additional color @bdarnell -- we'll track this on the product side.

bdarnell commented 5 years ago

Sharding timestamp keys is (currently) an application-level workaround that our SEs and I have been recommending. It's not even an experimental feature yet, although I think we should think about building something that makes this easier (and is compatible with range-based partitioning, although this is much less of a concern for me than just making the basic non-partitioned functionality easier to discover and use).

jseldess commented 5 years ago

@bdarnell, @nstewart, given the above, our current documentation is misleading: We highlight archival partitioning as a main use case, and the partition by range example focuses on this use case without calling out any of the limitations mentioned above.

What should we do? Update the docs to de-emphasize archival partitioning? Change the partition by range example to have another focus?

I also think it's probably important to inform SEs about this. I'm pretty sure they're selling various use cases for partitioning, including archival, even though nobody seems to have implemented it.

bdarnell commented 5 years ago

Yes, we should de-emphasize archival partitioning in the docs.

As an alternative use case for partitioning by range, maybe we could give an example with keys like "us-chicago" and "us-newyork" with partitions on the range "us" to "ut"? Other than that, I think archival partitioning is still the best use case for partitioning by range, even though it's only suitable for workloads where sharding isn't required to achieve the necessary write throughput.