cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

sql: sampling selects #7030

Open tbg opened 8 years ago

tbg commented 8 years ago

Postgres in 9.5 introduced the TABLESAMPLE clause:

A TABLESAMPLE clause after a table_name indicates that the specified sampling_method should be used to retrieve a subset of the rows in that table. This sampling precedes the application of any other filters such as WHERE clauses. The standard PostgreSQL distribution includes two sampling methods, BERNOULLI and SYSTEM, and other sampling methods can be installed in the database via extensions.

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression. (Other sampling methods might accept more or different arguments.) These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table's rows. The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

The optional REPEATABLE clause specifies a seed number or expression to use for generating random numbers within the sampling method. The seed value can be any non-null floating-point value. Two queries that specify the same seed and argument values will select the same sample of the table, if the table has not been changed meanwhile. But different seed values will usually produce different samples. If REPEATABLE is not given then a new random sample is selected for each query. Note that some add-on sampling methods do not accept REPEATABLE, and will always produce new samples on each use.

Prior to 9.5, similar things could be done manually: https://www.periscopedata.com/blog/how-to-sample-rows-in-sql-273x-faster.html https://stackoverflow.com/questions/8674718/best-way-to-select-random-rows-postgresql

Implementing something like TABLESAMPLE is likely relatively difficult, but we could check that a manual query which performs something similar is available and gets a somewhat decent query plan.

Opened this issue because I was asked about it in a recent tech talk.

Jira issue: CRDB-6181

knz commented 6 years ago

@petermattis @RaduBerinde what do you suggest we do about this?

AlexandreK38 commented 5 years ago

Will this be added to cockroach, and if so when? It's been 3 years since the issue was reported, so just wondering

knz commented 5 years ago

deferring to @RaduBerinde @rytaft for comments

rytaft commented 5 years ago

I am not aware of this feature being on the roadmap (cc @awoods187), but it wouldn't be very hard to implement the BERNOULLI method described above given that we're already doing something very similar for table statistics collection. Adding a REPEATABLE option should be relatively easy, but I'm not sure it would be that useful since any changes in data distribution could change the result, even without changes to the data itself.

Implementing the BERNOULLI method would actually be simpler than what we're already doing for CREATE STATISTICS because we wouldn't need to maintain a sample reservoir (that is only necessary to collect a pre-defined number of samples). We'd need to write a new DistSQL processor, but it would be VERY simple, consisting of a single random number generator to decide whether to keep or discard each row.

Adding a SYSTEM sampling method would require a different approach that is aware of how data is stored in RocksDB.

awoods187 commented 5 years ago

@Kumamon38 could you tell me a little bit more about how and why you'd like to use this potential feature?

AlexandreK38 commented 5 years ago

Hi and thanks everyone who answered and explained! I am currently using Nakama solution for my game and I was looking for the best way to pick random rows from the database. I saw on different posts on stack overflow several ways to do that and for Postgres the best one so far after years seemed to be the tablesample system since postgres 9.5, so I had a look if you did something similar and found this post. For now I managed to simplify how I pick my players without a random so it’s fine. I was just wondering if you will add it some day, not asking to add it :) Thanks anyway!

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

bladefist commented 3 years ago

Enterprise customer here, would like to throw our hat in this ring, we would be interested in this feature. Thank you.

rytaft commented 3 years ago

cc @vy-ton for visibility

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

rytaft commented 11 months ago

still relevant

israellot commented 4 months ago

Still relevant, I would benefit from this feature.