agentm / project-m36

Project: M36 Relational Algebra Engine
The Unlicense
876 stars 47 forks source link

use Arbitrary instances to generate random relations #173

Closed agentm closed 6 years ago

agentm commented 6 years ago

It would be handy for users to be able to create their own dud relations and save them to relvars. For example:

suppliers:=relation{sname Text, s# Text}{arbitrary}

could generate an arbitrary number of arbitrary tuples.

This has to be done carefully so as not to introduce side effects. The random seed may need to be saved to the database context.

This is separate from the ticket to test using quickcheck and Arbitrary.

YuMingLiao commented 6 years ago

hi @agentm ,

I really love this TutorialD and Date's teaching about Relational Database. It feels like it is the right way to build a database. The expression is clear and expressive. Thanks for this great work!

When I am trying TutorialD, I have an idea about adding a feature of sortBy and compare on attributes. I guess when people want to use projectM36 as an excel-like spreadsheet, this feature will be needed. (I don't know if it violates some "It's a Set" principle of relational algebra or the original purpose of this project. Or maybe it should be separated from TutorialD.)

Thus, I am interested in this arbitrary feature, too. I have managed to add some arbitrary function. It can do this now in stack ghci:

rel <- generate (arbitrary :: Relation)
DisplayOpResult . DisplayRelationResult $ rel 

then it will show up a relation with predefined attributes and an arbitrary tupleSet based on those attributes. (ConstructedAtomType not yet, haven't figure that out) (not considered side effect yet)

And it can parse suppliers:=relation{sname Text, s# Text}{arbitrary} now. But I don't know how to integrate it into existing types. Should it be MakeRelationFromArbitraryTuples attris? It seems then all the RelationExpr type pattern matching in StaticOptimizers.hs need to be changed. Should it be data TupleExpr = TupleExprBase () | Arbitrary? Then Binary instance seems need to be added. I haven't figure out all the associated changes from parsed expression to eval when a new Constructor is introduced.

Any advice on the type choice?

agentm commented 6 years ago

Hi @YuMingLiao, thanks for the detailed writeup.

If you take a look at Relation.hs, note that I intentionally added a random order when converting Relations to Lists, for example, for display. This should help to emphasize to library users that no specific order should be expected from Project:M36 Relations. This feature will become more pronounced after I add a new parallelized tuple processor. For example, restriction can be executed in an embarassingly parallel fashion. I am implementing this using the streamly package. Furthermore, there is no reason that Relations could be functionally generated and thus be able to "infinite" tuples, which would obviously be unsortable.

I mention these intentions because, as you alluded, Relations should not have any logical tuple order. If they did, then a lot of interesting set-based optimizations could fly out the window. Regardless, it's obvious that users will eventually expect that a DBMS sorts their output (plus LIMIT and OFFSET). To that end, I have been considering a new type and processing engine for converting Relations to DataFrames akin to pandas or R DataFrames which do have an ordering. I have identified some interesting optimizations for converting between these types, though nothing earth-shattering.

Converting to a DataFrame (other name suggestions are welcome) would emphasize the finality of that processing to the user- further relation algebra operators cannot be applied, though the DataFrame could potentially be converted back to a Relation. Normally, this conversion step would be final step in a data retrieval pipeline. Relying on a sort order within a relational expression will never be supported.

So, back to Arbitrary: I think there are two use cases:

It sounds like your feature targets the external tests, which is fine since that case is more interesting and useful anyway.

The syntax rv: = relation{...}{arbitrary} is interesting, but I have mixed feelings regarding whether it makes sense for a relational expression to have any sort of non-pure functionality. Thus far, I have intentionally avoided any user-visible random events in relational expressions, though I see the value of allowing users to test relational expressions with random tuple data.

Let's discuss an alternative. :createarbitraryrelation rvname {a int, b text} {a 3, b "test text"} 30-50

The above means "assign an arbitrary relation to rvname with fixed attributes a,b guaranteed to contain {3, "test text"} (perhaps for testing purposes) and a random tuple count between 30 and 50. This way, the randomness is extricated from the relational expression but the relvar would still be usable within relational expression (after creation). That way, we can keep any weird syntax out of relational expressions and perhaps add it to DatabaseContextIOExpr (for a fresh random seed).

The reason this distinction is important is because it has implications in logical replication. Ideally, the previous state + DatabaseExpr should be sufficient for two servers to remain in sync. This feature dies if a relational expression can generate random results.

I also thought about whether it would be useful to have arbitrary attributes, but I haven't thought of a use-case. Do you have something in mind for random attributes?

Do you have a branch I could examine to try out your new feature?

YuMingLiao commented 6 years ago

Hi, @agentm, thanks for the detailed explanation.

Now I know more clearly why sorting and relations shouldn't be mixed together.

And a DataFrame conversion is a great idea for separating relational algebra operation and DBMS operation. (I don't know about panda, but i guess keeping the same name is a great start.)

And No, I haven't thought of a use case of arbitrary attributes either. I am just looking for small issue/feature that I can make a contribution so I can practice Haskell.

(I was just led by a database question in my mind and found Chris Date's relational algebra video and this project. I found it's amazing and want to explore what it can do. )

createarbitraryrelation is great idea too, it doesn't mess up syntax. Let me try to add this first and then make it a branch.

YuMingLiao commented 6 years ago

During coding, I found this :createarbitraryrelation rvname {a int, b text} {a 3, b "test text"} 30-50 is simply an ASSIGN operation plus adding random tuples. Maybe I can just implement an operation that adds random tuples to a relvar. Or Assign plus AssignRandom plus Union to achieve this. (Does ProjectM36 has macro or script utility for this?)

agentm commented 6 years ago

Well, Assign is a relational operator which we want to retain as idempotent/pure, so I view createarbitraryrelation as a completely separate DatabaseContextIOOperator (which would not be used in replication). If the user wishes to add the random tuples to an existing relation, he could easily add a subsequent Insert. The goal here would be to separate the pure and impure operators.

YuMingLiao commented 6 years ago

hi @agentm I am glad that I have something to present now.

The syntax is createarbitraryrelation rvname {a int, b text} 30-50 I didn't add tupleExpr. I guess user can use insert to add what tuples they need later.

The result feature is like the following:

TutorialD (master/main): createarbitraryrelation a {number Integer, name Text, bool Bool} 1-2 TutorialD (master/main): :showexpr a ┌──────────┬──────────┬───────────────┐ │bool::Bool │name::Text │number::Integer │ ├──────────┼──────────┼───────────────┤ │False │"Sunny" │26 │ └──────────┴──────────┴───────────────┘ TutorialD (master/main): insert a relation{tuple{number 3, name "text", bool t}} TutorialD (master/main): :showexpr a ┌──────────┬──────────┬───────────────┐ │bool::Bool │name::Text │number::Integer │ ├──────────┼──────────┼───────────────┤ │False │"Sunny" │26 │ │True │"text" │3 │ └──────────┴──────────┴───────────────┘

Right now, arbitrary :: Gen TextAtom just generate some predefined names. and arbitrary :: Gen AtomType only support some simple AtomTypes now.

my branch is: https://github.com/YuMingLiao/project-m36/tree/feature/arbitrary_tupleset

agentm commented 6 years ago

Merged in 4f8653d54f0816f8d1bfb26ed9180afb5d4605cb.