edgedb / rfcs

RFCs for major changes to EdgeDB
Apache License 2.0
35 stars 5 forks source link

Pre-RFC: Range types #55

Closed vpetrovykh closed 1 year ago

vpetrovykh commented 2 years ago

Range types

Range types can exist as a sort of a scalar value, akin to arrays. Although they represent a set of values, but the way we manipulate these values is different from a pure EdgeQL set. It's a special case that is a little more structured.

Other programming languages seem to mostly use range types to loop over indexes. This biases the way ranges are defined. In EdgeDB range types are expected to be used in constraints or even as values unto themselves.

Base types

It seems like we need the following abstract types:

Boundaries

A range value should have an upper and lower bound and an indicator for each of them whether they are included in the values the range represents.

It should be possible to omit one or both of the boundaries to represent an open-ended interval. The {} seems like a reasonable representation of such an omitted boundary. It should be an error to omit the boundary while indicating that it is inclusive.

Constructors

One option is to have a special literal to construct a range. A particularly neat format can draw inspiration from the way ranges are indicated in math by using round and square brackets to indicate boundary inclusivity. E.g. [5..10) or (..-10]. The upside is that this format is familiar from math. The downside is that this messes with any parentheses matching processing (auto-complete, highlighting, etc.)

Another option is to have matching parentheses and use an extra symbol to indicate inclusivity. E.g. (=5..10) or (..=-10). It seems that exclusive boundary is the more natural default for this approach, largely because it works naturally with omitted boundaries. The upside of this format is that it plays nice with parentheses matching. The downside is that it's hard to search for and it's not necessarily intuitive when seeing it.

Finally, we can have a function constructor. To make things succinct the signature should be:

function range(
    lower: optional anypoint = {}, upper: optional anypoint = {},
    inclow: bool = true, inchigh: bool = false
) -> range<anypoint>

Basically, we want to be able to use positional arguments for brevity. We also want the common case of using ranges for some kind of iteration to only need the boundaries and have the inclusivity match how we use indexes in slicing.

The constructor function is compatible with also having a dedicated literal, so it can be implemented first.

Accessing values

We can have functions to access the 4 different components of a range value. E.g. get_range_component(range_val, 'lower') or 4 dedicated functions get_lower_bound(range_val).

Alternatively, we can have . accessors directly on the range values. E.g. range_val.upper or range_val.includes_lower.

We can have [] accessors directly on the range values. E.g. range_val['upper'] or range_val['includes_lower'].

Perhaps instead we can have a converter function that takes a range and produces a named tuple (need to deal with {} boundaries) or it can produce json.

Multi-ranges

Performing operations with ranges should include ways of combining ranges into ranges with gaps. Basically, it's like having a set of ranges, but a bit more specialized than basic sets for the same reasons that ranges need to be specialized. We call these multi-ranges.

Base type

It seems that the type for multi-range would be something like: multirange<sometype> where sometype is a concrete type extending anypoint.

Constructor

A multi-range could be constructed using a function that takes a set of basic range values:

function multirange(ranges: set of range<anypoint>) -> multirange<anypoint>

It is probably important to normalize the resulting multi-ranges to always be composed of non-overlapping ranges regardless of whether the set of input ranges overlaps.

Interoperability

Many of the range operations should also work on combination of range and multi-range as well as multi-ranges. It is desirable to have an implicit cast from range to multirange and probably an assignment cast for multirange to range.

Accessing sub-ranges

It may be useful to organize individual sub-ranges in a multi-range like an array. This way we can access them by indexing and even use len to determine how many sub-range pieces there are in a multi-range. If we break down every multi-range into non-overlapping sub-ranges then we can always order them in a consistent way by their boundaries and use that ordering for index values.

Alternatively we can have a function get_subranges that produces an array of range components of a multi-range ordered by a choice of several properties like "boundary" and "range size". This would be useful if more than one way of ordering sub-ranges is useful in practice.

Usages

Both ranges and multi-ranges can be used to determine constraints.

Both ranges and multi-ranges can be used as properties directly and potentially those properties can be indexed to speed up filters, etc.

Both ranges and multi-ranges can be expanded as a set of values to be potentially iterated over:

function range_unpack(val: range<anydiscrete>) -> set of anydiscrete ...
function range_unpack(val: multirange<anydiscrete>) -> set of anydiscrete ...
function range_unpack(
    val: range<anycontiguous>, step: anycontiguous
) -> set of anydiscrete ...
function range_unpack(
    val: multirange<anycontiguous>, step: anycontiguous
) -> set of anycontiguous ...
tailhook commented 2 years ago

A bunch of questions in no particular order:

  1. Are multi-ranges needed? I think set of range<x> is fine. We can have range_canonicalize function to sort and unionize sub-ranges. What are very common use cases for multirange?
  2. Rust has .. and ..= operators. No parenthesis for ranges. Seems to work fine, why we need parenthesis?
  3. How range_unpack will behave on unbounded ranges? Does it crash? Does it crash only when lower bound is unbounded (since it should be possible to produce unbounded sequence in SQL, right?)
  4. I don't remember any language which has exclusive lower bound. So we can probably have it in the type system, but I don't think we need a special syntax for that. I mean 0..10 is much better than =0..10 or =0..=9.
  5. Rust has ranges for any orderable type (which are all of them in EdgeDB, right?). So in our terms it means that .name between "aaa".."aac" works nicely as well as .name between <range<str>>$0 (although between isn't great operator for that)
  6. Actually character ranges and arbitrary object ranges (and Btree search using ranges) is, I believe, the most important use case for the "inclusive ranges" in Rust. I.e. for values that don't have obvious "subsequent value".
  7. Does constructor check that lower_bound <= upper_bound or the violation of the rule equivalent to an empty range?
elprans commented 2 years ago

A bunch of questions in no particular order:

  1. Are multi-ranges needed? I think set of range<x> is fine. We can have range_canonicalize function to sort and unionize sub-ranges. What are very common use cases for multirange?

Multirange is a non-contiguous interval. A very simple use case: hours of operation of a shop. We want a scalar type and not a set of range<x> because we want to be able to use Postgres' multirange type. Important for performance.

  1. Rust has .. and ..= operators. No parenthesis for ranges. Seems to work fine, why we need parenthesis?

I think we decided to postpone the decision on range literals and only add function constructors initially.

  1. How range_unpack will behave on unbounded ranges? Does it crash? Does it crash only when lower bound is unbounded (since it should be possible to produce unbounded sequence in SQL, right?)

Unbounded range_unpack will raise an error.

  1. Rust has ranges for any orderable type (which are all of them in EdgeDB, right?). So in our terms it means that .name between "aaa".."aac" works nicely as well as .name between <range<str>>$0 (although between isn't great operator for that)

No, in EdgeDB we'll have an anypoint type which represents data types that support vectors, which is numerics and dates.

tailhook commented 2 years ago
  1. Rust has ranges for any orderable type (which are all of them in EdgeDB, right?). So in our terms it means that .name between "aaa".."aac" works nicely as well as .name between <range<str>>$0 (although between isn't great operator for that)

No, in EdgeDB we'll have an anypoint type which represents data types that support vectors, which is numerics and dates.

Sorry, I don't understand this answer. Will EdgeDB support range<anypoint>? I'm not sure range<vector> makes sense, but range<str> or range<tuple<str, int32>> can easily be used for various queries, like pagination.