`offset` keyword not working as expected on partitioned data

universalmind303 commented 8 months ago

Description

offset is not being applied properly when reading from files with multiple recordbatches

Steps to reproduce

Create a large csv. Let's use tpch customer table.

NOTE: This is not limited to just csv, I tried it with other file formats & it produced the same results.

use duckdb to generate the tpch table

D LOAD tpch;
D CALL dbgen(sf=1);
D COPY customer TO './customer.csv' (FORMAT 'csv', HEADER true);

glaredb

> select c_custkey from './customer.csv' limit 1 offset 1;
┌───────────┐
│ c_custkey │
│        ── │
│     Int64 │
╞═══════════╡
│     50111 │
└───────────┘
> select c_custkey from './customer.csv' limit 1 offset 1;
┌───────────┐
│ c_custkey │
│        ── │
│     Int64 │
╞═══════════╡
│     75166 │
└───────────┘
> select c_custkey from './customer.csv' limit 1 offset 1;
┌───────────┐
│ c_custkey │
│        ── │
│     Int64 │
╞═══════════╡
│         2 │
└───────────┘
> select c_custkey from './customer.csv' limit 1 offset 1;
┌───────────┐
│ c_custkey │
│        ── │
│     Int64 │
╞═══════════╡
│     87728 │
└───────────┘

Other databases (clickhouse & duckdb) apply the offset correctly. GlareDB seems to be applying the offset inside of the individual recordbatch/partition, then returning the first one instead of applying at the scan (file).

tychoish commented 8 months ago

To be honest, I think offset without a sort isn't a particularly meaningful operation and it would be reasonable if we said "the order of results in any query without a sort is undefined."

A lot of databases will return results in a consistent order because its faster and it appears consistent but I know mongodb (particularly in sharded systems) has behavior like this (without a sort).

I don't think we should implicitly impose ordering on queries that have offsets that don't already have them (or implicitly order anything, really).

universalmind303 commented 8 months ago

To be honest, I think offset without a sort isn't a particularly meaningful operation and it would be reasonable if we said "the order of results in any query without a sort is undefined."

A lot of databases will return results in a consistent order because its faster and it appears consistent but I know mongodb (particularly in sharded systems) has behavior like this (without a sort).

I don't think we should implicitly impose ordering on queries that have offsets that don't already have them (or implicitly order anything, really).

That's beside the point. We are still doing extra work by attempting this on all partitions. We should be able to push that down to the scan and avoid reading all partitions except the one (or multiple) needed to fulfill the query.

If there is a limit 10 offset 1000 of 1000 sized partitions, then we can prove that we can skip reading anything from the first partition. It'll give us correct ordering, and do much less work.

tychoish commented 8 months ago

It'll give us correct ordering, and do much less work.

It's undefined behavior, so there is no correct behavior.

It'd probably be fine to just error for unordered offsets: writing this query represents a programmer error, and committing to preserving the order of rows in result sets without ordering constraints means you can't do parallel processing.

If we don't want to error. I think it'd be fine to ignore the offset if the results aren't ordered. It'd be a more efficient and since order isn't defined unless specified, it's just as correct.

universalmind303 commented 8 months ago

I think you are still missing the point I'm trying to make.

for reading a SINGLE FILE you need to either parse the metadata (parquet, arrow, ...), or infer it (ndjson, csv, ...) BEFORE you start any parallel operations.

Since we already know the actual layout (parquet, arrow), or approximate layout (ndjson, csv) of the file BEFORE any parallel operations can happen, we can immediately eliminate a lot of work by removing the chunks that don't fit into that limit + offset slice. This is how other tools such as polars work. They push down the entire slice to the file reader instead of the individual chunks. The query has everything needed to prove that the limit + offset can be pushed down as a slice to the file reader. So not only do optimize away a bunch of work processing unnecessary chunks, but we also get results that are consistent with other tools (duckdb, clickhouse, and polars).

Obviously this doesn't apply to multiple files, but a limit x offset y for SINGLE FILES, provides consistent results for most olap tools, and there's no reason it shouldn't for us.

tychoish commented 8 months ago

I don't think we should special case the ordering guarantees for single file cases. Users' reliance on implicit/undefined ordering semantics will only frustrate them if they change unexpectedly.

If there are multiple files, can we return the results from a SINGLE in arbitrary order? Do we have to return the results of multiple files in specific orders if no order-by is specified?

What happens for globs that only match one file? If we return the result in a specific order in this case and then the glob later matches more than one file (or the filesystems changes between queries,) and users expect the ordering, it'll be confusing.

If we have multiple files, should we return each file in a consistent order? If you have have to process the partitions of a single file serially, why do we partition them at all?

This is undefined behavior. There is nothing here that is more or less correct. If users need an order they should specify it.

I think it's fine (and expected) that Polars would do this differently, and given that it's a dataframe library and not a SQL system, I wouldn't be confused if it had different semantics and ordering. Because this is undefined behavior, the fact that other systems have a particular behavior is really just an artifact of their implementation details.

The optimization we're talking about here is pretty small if it's only limited to single files, and given that partitions /streams are processed relatively lazily.

By limiting the optimization to single files, I think the system gets more confusing and predictable if people take a query that (incidentally, because of a glob, or for another) goves from a single file operation to a multiple file operation. This just increase in cognitive load and reduces the ergonomics of the system.

It seems reasonable despite all of this, to push limits down into the filereader, particularly for lance/parquet (and limits+offsets are still limits and can be ordered/optimized as limits), rather than streaming multiple partitions from a single file. But really only in the simple case, and I imagine that it doesn't make a lot of sense to partition the data from a single file in other formats, so it's a bit weird that we're doing that at all.

scsmithr commented 8 months ago

I agree with Sam here. LIMIT/OFFSET without an ORDER BY shouldn't be expected to return consistent results. Duckdb/clickhouse return consistent results in some circumstances. They both make no guarantees when using limit/offset without sorting.

From duckdb docs:

Note that while LIMIT can be used without an ORDER BY clause, the results might not be deterministic without the ORDER BY clause.

An example with duckdb that returns inconsistent results:

sean@Seans-MacBook-Air duckdb % ./build/release/duckdb
v0.8.2-dev161 563662b093
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D PRAGMA temp_directory='/tmp/duck';
D LOAD tpch;
D SET preserve_insertion_order = false;
D CALL dbgen(sf=10);
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
D COPY customer TO './customer.csv' (FORMAT 'csv', HEADER true);
D select c_custkey from './customer.csv' limit 1 offset 100000;
┌───────────┐
│ c_custkey │
│   int64   │
├───────────┤
│    493281 │
└───────────┘
D select c_custkey from './customer.csv' limit 1 offset 100000;
┌───────────┐
│ c_custkey │
│   int64   │
├───────────┤
│    618195 │
└───────────┘
D select c_custkey from './customer.csv' limit 1 offset 100000;
┌───────────┐
│ c_custkey │
│   int64   │
├───────────┤
│    501443 │
└───────────┘

GlareDB / glaredb