dbis-uibk / relax

RelaX - a relational algebra calculator
http://dbis-uibk.github.io/relax/
MIT License
280 stars 96 forks source link

Support relational algebra on bags/multisets #207

Closed WolfgangHG closed 3 months ago

WolfgangHG commented 4 months ago

See e.g. https://turingmachine.org/courses/2007/saved.csc370S07/lectures/04_rel-algebra2.pdf This alternative results in duplicates not being eliminated e.g. by projection. The union, intersection, and difference operators behave differently. A new delta operator is introduced that eliminates duplicates.

Would be great if you could add support for this.

rlaiola commented 4 months ago

@WolfgangHG could you kindly provide a dataset that works in RelaX to showcase the reported issue?

rlaiola commented 4 months ago

Note taken for the projection operator from Silberschatz et al., "Database System Concepts 7th Edition", Slide 2.13 (page 13): "Duplicate rows removed from result, since relations are sets"

It seems to be working accordingly.

Cheers!

Reference: https://db-book.com/slides-dir/PDF-dir/ch2.pdf

WolfgangHG commented 4 months ago

Yes, the current implementation works as expected.

But there is an alternate mode for relational algebra. In german it is called "Multimengenalgebra", and the english name seems to be "Relational Algebra on Bags" or "Relational Algebra on Multisets", which behaves differently with duplicates. E.g. if you perform a union "R union R" with the current implementation, you will receive "R" again. With the multiset concept, "R union R" will not eliminate duplicates, so the result will contain all rows from R twice.

The projection "pi" removes duplicates in current mode, but in multiset algebra, this does not happen.

To remove duplicates, a new operator "delta" is defined.

This multiset mode is the same behaviour that happens in SQL.

So this ticket is about adding this multiset mode to the tool as an alternative option.

rlaiola commented 4 months ago

@WolfgangHG I've just finished implementing it! See https://github.com/dbis-uibk/relax/pull/208

Cheers!

WolfgangHG commented 4 months ago

Amazing! If you have a working server, I could give it a try and test whether my use cases work. Actually I don't want to build it myself - I am not a NodeJS guy ;-)

rlaiola commented 4 months ago

Amazing! If you have a working server, I could give it a try and test whether my use cases work. Actually I don't want to build it myself - I am not a NodeJS guy ;-)

You can try it out at https://rlaiola.github.io/relax/calc/local/uibk/local/0

Example: project column of a relation

pi b (R)
Screen Shot 2024-05-17 at 21 24 01

results in

R.b
a
c
d
e
Screen Shot 2024-05-17 at 21 27 23

results in

R.b
a
c
d
d
e

If needed, the duplicate elimination operation (∂) can be used to remove duplicate tuples.

Screen Shot 2024-05-17 at 21 30 41

Even the definition of relations (better saying, bags/multisets) support duplicate tuples now. The implementation would benefit from more extensive testing but I believe that the core behavior for all operations is already there. Let me know if it works as it should.

For more information refer to:

Cheers!

WolfgangHG commented 4 months ago

I can confirm that all my test cases return the expected results.

It seems your pull requests don't get accepted by the repository owners, so there is probably no chance to get this feature released?

There is one problem with a right semi join where the result appears wrong - but the calculator returns exactly the same result that the alternate definition for a right semi join ("pi all columns of S (R natural join S)" ) would return.

My data:

group: My test

R = {
a, b, c
1, 2, 3
3, 5, 7
3, 5, 7
}

S = {
b, c, d
8, 8, 9
5, 7, 9
1, 2, 3
}

And the query:

R ⋊ S
Result: S.b S.c S.d
5 7 9
5 7 9

I would expect the semi join to not return more rows than S contained, as the plain text definition says "The result is the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names" - so no additional tuples should appear:

S.b S.c S.d
5 7 9

This is a problem that not even my database professor has a solution for ;-). And I did not find any alternate definition in the web for multiset/bag algebra. So, we can consider this behavior as "correct".

rlaiola commented 4 months ago

I can confirm that all my test cases return the expected results.

Nice!

It seems your pull requests don't get accepted by the repository owners, so there is probably no chance to get this feature released?

The project seems stale (last release from 3.5 years ago)... anyhow, I am sharing my contributions and making use of them in my own database courses with ~70 students/year. Who knows, they can be useful to others as well.

There is one problem with a right semi join where the result appears wrong - but the calculator returns exactly the same result that the alternate definition for a right semi join ("pi all columns of S (R natural join S)" ) would return.

My data:

group: My test

R = {
a, b, c
1, 2, 3
3, 5, 7
3, 5, 7
}

S = {
b, c, d
8, 8, 9
5, 7, 9
1, 2, 3
}

And the query:

R ⋊ S

Result:

S.b S.c S.d 5 7 9 5 7 9 I would expect the semi join to not return more rows than S contained, as the plain text definition says "The result is the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names" - so no additional tuples should appear:

S.b S.c S.d 5 7 9 This is a problem that not even my database professor has a solution for ;-). And I did not find any alternate definition in the web for multiset/bag algebra. So, we can consider this behavior as "correct".

Your argument is aligned with this paper with over 650 citations:

"The semi-join operator takes the join of two relations, R and S, and then projects back out on the domains of relation R. ~ That is, it retrieves those tuples in R that join with some tuple in S. Alternatively, one can think of semi-join as a generalization of restriction; it restricts R by values that appear in S's join domain."

Philip A. Bernstein and Dah-Ming W. Chiu. 1981. Using Semi-Joins to Solve Relational Queries. J. ACM 28, 1 (Jan. 1981), 25–40. https://doi.org/10.1145/322234.322238

Finally, I fixed the implementation and you should get the expected result now: https://rlaiola.github.io/relax/calc/local/uibk/local/0

Thanks for pointing the way, I also learned with it :-)

Cheers!

WolfgangHG commented 4 months ago

Great work!

Hopefully, we can use this tool or your copy again in next years semester here - I just have to convince my professor. I just do some practical courses, so I can't decide about it.

rlaiola commented 4 months ago

Good luck! Could you share the name of your university? Cheers

WolfgangHG commented 4 months ago

"Hochschule Rhein-Main" / "RheinMain University of Applied Sciences" in Wiesbaden, Germany

evazangerle commented 3 months ago

Thank you so much for your contributions. I have just merged it to the development branch.