Open haikyuu opened 4 years ago
We can also access the current index by using
_index
and the original set using_source
Generally, I'd avoid introducing magic variables. If you need the original set, alias it, and index can be obtained with enumerate
, e.g:
WITH words := { "hello", "world" }
FOR sentence = "", word IN enumerate(words)
REDUCE (
SELECT (
(<str>(SELECT count(words)) ++ " ") IF word.0 = 1 ELSE "")
++ sentence ++ " " ++ word.1 ++ " "
)
With that, I think there are two orthogonal problems here: recurrence relations and aggregation.
Recurrence relations are set-returning functions where each subsequent element is defined as a function of preceding elements. The Fibonacci sequence is a classic example of this. These are represented in SQL as recursive CTEs, but we can do better in EdgeQL and actually define them as recursive functions (possibly allowing ad-hoc definitions of lambdas in a query).
Custom aggregation. The standard set of aggregate functions actually covers most common use cases. For example your sentence generation example can already be accomplished by a combination of array_agg
and array_join
. For cases where custom aggregation is truly needed, I think PostgreSQL's CREATE AGGREGATE would be a good starting point.
Recurrence relations are set-returning functions where each subsequent element is defined as a function of preceding elements. The Fibonacci sequence is a classic example of this. These are represented in SQL as recursive CTEs, but we can do better in EdgeQL and actually define them as recursive functions (possibly allowing ad-hoc definitions of lambdas in a query).
And when we have lambda functions https://github.com/edgedb/edgedb/issues/301 we might use them as a mechanism/syntax to write recursive queries.
For cases where custom aggregation is truly needed Here is my use case: Intersection
type Library{ multi link books -> Book; }
I want to query books of multiple libraries:
getBooks($libraryIds)
.
I want to query books of multiple libraries
Here's one way of doing it:
SELECT Book
FILTER count(.<books[IS Library) > 1
To select all books from a set of libraries given a set of library ids:
SELECT Book
FILTER .<books[IS Library].id IN array_unpack($libraryIds)
The idea here is that when you select a set of Book
objects like that, it is a proper set already, there would be no duplicates in it.
Thanks for the answer. I will describe my use case more clearly, i think it's not clear yet.
I run a complex query to calculate a set of available TimeSpot
s for a Class
. And i created a class for that function SELECT getClassAvailability(<uuid>$classId, <int64>$maxIn)
. This function return a SET of TimeSpot
s.
Now i want to get the availability of multiple Class
es. It's actually the TimeSpot
s that are shared between Class
es. That is their Intersection.
Using the REDUCE
operator, this would look like this:
FOR availabilities = {}, classId in array_unpack($classIds)
REDUCE (
WITH classAvailabilities = getClassAvailability(classId, $maxIn)
SELECT (
(classAvailabilities FILTER classAvailabilities IN availabilities)
IF count(availabilities) > 0
ELSE classAvailabilities
)
As far as i understand, this is not a recurrence relation since each element depends on the whole set and not only the previous element.
WITH classAvailabilities = getClassAvailability(classId, $maxIn)
SELECT (
(classAvailabilities FILTER classAvailabilities IN availabilities)
IF count(availabilities) > 0
ELSE classAvailabilities
)
REDUCE
. Or should we specify it?It's also possible to implement REDUCE
before implementing custom aggregate functions: similarly to FOR
statement. That would give us a starting point, feedback on the syntax (i personally think Postgres custom aggregate functions syntax is rather verbose, maybe lambda functions would help?) and performance.
Interesting. We do have a plan to implement a more general partition-and-reduce statement: https://github.com/edgedb/edgedb/issues/104#issuecomment-344307260. It should be possible to express what you're looking for using that syntax:
WITH classId := array_unpack($classIds)
GROUP
slotTable := (classId, getClassAvailability(classId, $maxIn))
USING
slotId := slotTable.1
BY
slotId
INTO
partitionedSlotTable
UNION
slotSet := (
slotId := slotId,
classAvailabilityCount := count(partitionedSlotTable.0)
)
FILTER
# make sure the slot appears in the set of slots for every class,
# i.e. an intersection of availability sets.
slotSet.classAvailabilityCount = len($classIds)
This syntax is admittedly a tad verbose for this particular purpose due to it being very generic (we will likely iterate on it before implementing and/or add some syntax sugar around it).
Basically. because EdgeQL queries have to map 1:1 to SQL queries we are constrained by the type of semantics we can implement efficiently. Specifically, there's no real notion of iteration state in SQL, aside from the aforementioned recursive CTEs and aggregate functions. Another quirk is that those SQL mechanisms work on scalar expressions, not sets, which means your accumulator state must be scalar also.
EdgeQL queries have to map 1:1 to SQL queries
Interesting, i didn't know that. Why is this? How do you make sure it's performant?
This syntax is admittedly a tad verbose for this particular purpose
Yes, it is. I spent a dozen of minutes trying to understand it ... but i couldn't. Also couldn't find it in the docs.
USING slotId := slotTable.1
Please note that getClassAvailability
returns a set of spots. Does that make my use case not feasible using Group and reduce
?
Another quirk is that those SQL mechanisms work on scalar expressions, not sets
Oh! So even if there were a way to define custom aggregate functions, it wouldn't work for my use case? 🤔
That leaves me with EdgePL
:
EdgePL
when it's out.FOR ... REDUCE
operator using EdgePL
or a C defined function
in postgres (for performance) under the hoodInteresting, i didn't know that. Why is this?
Preserving a 1:1 query correspondence has a number of important advantages:
Performance. Running multiple SQL queries and reconstructing the result on the client is almost always worse than running an equivalent single query in Postgres. A good example of this are the benchmarks we published for Alpha 1.
Atomicity. If EdgeDB generated multiple SQL queries for an EdgeQL query, those queries would have to be put in an explicit transaction block to be self-consistent (especially important for mutation statements). This transaction would have to span multiple server roundtrips and can potentially be left open for a prolonged amount of time if an EdgeDB thread or task is preempted for a while, and when you have lots of transactions like this you increase the risk of lock contention, serializability conflicts etc.
Simplicity. 1:1 mapping is far easier to manage and reason about on the protocol level as well as observability.
How do you make sure it's performant?
Postgres is actually really good at handling complex queries. It's not the size of the query text that matters, it's the inherent runtime complexity of the operations in it.
Yes, it is. I spent a dozen of minutes trying to understand it ... but i couldn't. Also couldn't find it in the docs.
That's because it isn't implemented yet (aside from being able to parse the syntax).
Please note that
getClassAvailability
returns a set of spots. Does that make my use case not feasible using Group and reduce?
The hypothetical query I posted accounts for that. Remember that in EdgeQL everything is a function over a Cartesian product of the arguments, including the tuple constructor, so
slotTable := (classId, getClassAvailability(classId, $maxIn))
is a set of (classId, timeSlot)
tuples. We then partition the set by the slot and count the number of classIds (that would be SELECT count(classId) GROUP BY slot
in SQL). We then check that the count of classId
equals the number of elements in $classIds
to only return the slots available for all classes requested (i.e. the intersection of slot sets).
Thank you for your answers @elprans
This partition and reduce statement is really powerful.
we will likely iterate on it before implementing and/or add some syntax sugar around it
I think it's the closest thing to my original idea. Very interesting
Taken from Spectrum.
This is how i first imagined the syntax:
Note that this doesn't allow defining arbitrary aggregate functions, but rather executing them, just like
FOR
.Example 1
This would give a result of
hello world
.We can also access the current index by using
_index
and the original set using_source
Result:
2 hello world