AmpersandTarski / Ampersand

Build database applications faster than anyone else, and keep your data pollution free as a bonus.
http://ampersandtarski.github.io/
GNU General Public License v3.0
40 stars 8 forks source link

First step in adding calculation in Ampersand #1256

Open Michiel-s opened 2 years ago

Michiel-s commented 2 years ago

In relation to the discussion in #207, I propose to do a small experiment to see how we can add calculations to our expressions and what is needed for that.

Inspired by the SHACL constraint language for RDF (https://www.w3.org/TR/shacl/#core-components), we can define a list of calculations to support, like lessThan and lessThanOrEqual.

E.g. for lessThan we could add the operator <. We need to define for which types this is supported and the precedence of the operator. The types INTEGER, FLOAT, DATE and DATETIME.

RELATION r[A*Date]
RELATION s[A*Date]
REPRESENT Date TYPE DATE

RULE I[A] |- r < s

The meaning of the operator < would be the set of A*A for which the value of tgt of r is less than the value of tgt of s, with the restriction that r and s are univalent. So r < s is basically a calculated property for A.

The generated SQL can be a join of r and s (like with r;s) but instead of the WHERE clause stating r.tgt = s.src we need r.tgt < s.src.

What do you think @hanjoosten @sjcjoosten? Can we add this one to see how it goes?

hanjoosten commented 2 years ago

I like the idea. I know @sjcjoosten has done an experiment long time ago like this. I whould like to have the semantics of the operators involved really clear. My feeling is that this should fit with database functions. This might make us more reliant on specific databases, but I do not see that as a problem. See here for some inspiration. I guess that if we start with comparison functions, that should already help a lot. However, we must do this in a formally correct way, so I really would like to hear the opinion of @sjcjoosten on this.

sjcjoosten commented 2 years ago

I love the idea, but I don't believe we should introduce a separate operator, as < is a relation, much like I and V, though only defined on concepts that are represented as DATE, INT, and perhaps a few others. The expression r < s, if I understand what you mean correctly, would be equivalent to r;<;s~ where < is a relation. Translating your example:

RELATION r[A*Date]  [UNI] -- The 'uni' is here because this is a translation of Michiel's example
RELATION s[A*Date]  [UNI] -- The 'uni' is here because this is a translation of Michiel's example
RELATION <[Date*Date] -- note: if we build this into Ampersand, I propose that you wouldn't declare it, just like we don't declare I and V.
REPRESENT Date TYPE DATE

RULE I[A] |- r;<;s~

It would be great if the generated SQL were as succinct as you describe, as well as having some reasoning like knowing what - < is (it is the greater-or-equal relation). All quite feasible and all of this can wait until after we have a < relation. There are also some details to be worked out regarding the type checker, because I think we could decide to prevent expressions like r;<;<;s from occurring via the type system, requiring the user to write something like r;<;I[Date];<;s instead, which I believe has a different meaning (the difference is that I would like to interpret r;<;<;s as r;<;I[DATE];<;s, and I[Date]≠I[DATE]).

Michiel-s commented 2 years ago

I fully agree with @sjcjoosten by adding these kind of calculations as built-in relations.

Indeed, the functionality I meant to describe is then formulated as r;<;s~, well actually I had I /\ (r;<;s~) in mind, but the way you define < gives more flexibility.

What could be the first step? @sjcjoosten you starting with the type checker work?

If I can get some voice over on the SQL generator part I can help with coming up with the right SQL code. @hanjoosten, something we can pick up?

Based on what I can oversee right now, no work is needed on the prototype framework implementation. It's all expressions and generated SQL work.

stefjoosten commented 2 years ago

I agree with @sjcjoosten that r;<;<;s should be interpreted as r;<;I[DATE];<;s, where I[DATE] is the lub of the source and target of <. Simply because that is how the type rule of ; is (in the current system). I see no reason to change that nor to forbid r;<;<;s.

stefjoosten commented 2 years ago

I suggest to use =< for less than or equal, >= for greater than or equal, so <= and => remain free for future use.

stefjoosten commented 2 years ago

Here is a plan:

RieksJ commented 2 years ago

why use =< for less than or equal while the rest of the world uses <=? Where <= is easily seen (and remembered) as <-less than or =equal, =< would then read as 'equal or less than', which is awkward.

sjcjoosten commented 2 years ago

I agree with @sjcjoosten that r;<;<;s should be interpreted as r;<;I[DATE];<;s. I see no reason to change that nor to forbid r;<;<;s.

I see no easy way to generate SQL code for r;<;I[DATE];<;s (or what that expression would even look like in SQL).

stefjoosten commented 2 years ago

why use =< for less than or equal while the rest of the world uses <=? Where <= is easily seen (and remembered) as <-less than or =equal, =< would then read as 'equal or less than', which is awkward.

That is settled then....

stefjoosten commented 2 years ago

We have a new branch, feature/greaterThan-issue-1256.

stefjoosten commented 2 years ago

When working on this issue, @hanjoosten and I stumbled on a problem. After a good discussion, we decided to ask for help because this requires some good thinking wrt consequences for the type checker. @sjcjoosten, please help to analyze this.

What happened

We are building <= as a built-in relation. This leads to the following changes in the Haskell code of the Ampersand compiler:

  1. As user defined relations all start with a letter, built-in relations start with a non-alphanumeric character. That is sufficient to prevent an Ampersand programmer from (re-)defining built-in relations. In this issue we will only implement the relation <= on numbers.
  2. We add a constructor EBui :: Relation -> Expression, which is used in the SQL generator (a.o.) to distinguish built-in relations from user-defined relations.
  3. We add a constructor PBuiltInR :: P_BuiltInRel -> TermPrim, so the parser can allow the Ampersand programmer to write <= in every place where a relation identifier is allowed.
  4. We add a parse function pBuiltIn to parse the built-in relation <=.
  5. In P2A_converters, where the transformation from parse tree to typed A-structure is made. The compiler collects all relevant relation declarations in a declaration table called declMap :: Map Text (Map SignOrd Expression). The type checker uses that declaration table to disambiguate terms.
  6. The idea is to extend declMap with the information of the built-in relations, so we can leave the type checker alone.
  7. If <= is to be a built-in relation, we get a table of built-in relations. This table would look something like this:
      builtins :: [Relation] 
      builtins = [Relation
                    { decnm   = T.pack "<="
                    , decsgn  = sgn
                    , decprps = Set.empty
                    , decDefaults = Set.empty
                    , decprL  = ""
                    , decprM  = ""
                    , decprR  = ""
                    , decMean = []
                    , decfpos = Origin (“built-in relation ( <= "<>T.pack (show sgn)<>")")
                    , decusr  = False
                    , decpat  = Nothing
                    , dechash = hash sgn
                    }]
                    where sgn = Sign c c
                          c = PlainConcept ("Num" NE.:| [])
  8. Now we have a problem: In the current implementation, we cannot define the signature, sgn, of the built-in relation properly. (I wrote “Num”, but that type doesn’t exist.). This phenomenon requires some discussion.

Some ideas

Han and I saw many different ways to tackle this problem. Here are a few, but perhaps there are more:

  1. Define builtins :: [Protorelation] instead of builtins :: [Relation], where the Protorelation contains a variable instead of the fully defined signature. The type checker can fill the variable along the way and generate an error message if it fails to do so. We were wondering how much impact this will have on the type checker. @sjcjoosten please comment.
  2. Introduce three built-in primitive concepts, NUM, STRING, and DATE, that correspond to our technical types (and to the ISO standard for MySQL and Excel). Then define <=[NUM,NUM], <=[STRING,STRING], and <=[DATE,DATE] as predefined relations in the declaration table declMap and let the type checker sort things out. Of course, we want to allow the user to add a type annotation to <= to disambiguate where needed. This should then work for all concepts that are a specialization of the three primitive types, NUM, STRING, and DATE. This approach seems to agree with earlier ideas to incorporate technical types into the type system. However, this also affects the type checker, so @sjcjoosten please comment.
  3. Implement <= as an operator in the compiler, instead of a primitive term. In that case, I want to call it ";<=;" because I don't want to explain to students that they should read r<=s as r;<=;s. @hanjoosten is now working on this thread in a separate branch, if only to see if we stumble on other difficulties. This seems to be at odds with the original idea of <= as a relation. An obvious impact on the type checker is that we introduce new code that bears similarity with the type checker code for the composition operator, ;. Would that be doable without any programming effort of @sjcjoosten?

Please feel free to offer other ideas...

sjcjoosten commented 2 years ago

We should go with option 2 above, provided that the user cannot write NUM, STRING, etc directly, but only via 'representations'. The built-in primitive concepts are already there, and even already present in the type-checker. See: https://github.com/AmpersandTarski/Ampersand/blob/main/src/Ampersand/ADL1/P2A_Converters.hs#L485

Of course, after type-checking, the built-in primitive types like NUM and STRING are removed: they need to be specialized to whatever the user defined, and that's why A_Concept does not have a representation for them. If they cannot be removed, a fatal is thrown, see: https://github.com/AmpersandTarski/Ampersand/blob/main/src/Ampersand/ADL1/P2A_Converters.hs#L50

Some reworking of the Ampersand datatypes is unavoidable [edit: I have some ideas on how to minimize the need for this, which I lay out below. I removed the thoughts that no longer apply.]

stefjoosten commented 2 years ago

@sjcjoosten , What is the problem with alternative 3? Which obstacles will we encounter if we walk that path?

stefjoosten commented 2 years ago

@sjcjoosten said:

Of course, after type-checking, the built-in primitive types like NUM and STRING are removed

If <=[NUM,NUM], <=[STRING,STRING], and <=[DATE,DATE] are built-in relations, then we might argue that NUM, STRING, and DATE are built-in concepts as well. So why remove them? Being built-in means that the user can use them as any other concept. The difference is just that they are always there (as some other things are). In the generated documentation, we might hide built-in relations and concepts if the script does not refer to them. That would keep such a generated document more readable.

sjcjoosten commented 2 years ago

In accordance with 2., I suggest that <= can work very similarly to I, in that one can alway read/interpret it to be a <=[NUM \/ STRING \/ DATE, NUM \/ STRING \/ DATE] relation, but the type-checker will ensure that it can fill in a user type (some A_Concept). Note that this user-type might not be Endo: <=[Start*End] might occur in a rule that expresses that end-times should follow start-type.

As a consequence, a rule like r;<= = s;<= would not be allowed by the type-checker: the target of r;<= is NUM \/ STRING \/ DATE, which is not a user type. However, writing r;<=;I[Birthday] = s;<=;I[Birthday] would be fine.

Another consequence of this would be that the type-checker can fill in the sign of the <= relation as a normal user type.

stefjoosten commented 2 years ago

Idea nr. 3 has a problem. If we let the Ampersand programmer believe that <= is a relation, then she will sooner or later use expressions such as <= - I[A]. And we would surely want to allow such things because the semantics are perfectly clear. Under idea nr. 3, in which <= is an operator, such expressions are syntactically incorrect.

On the other hand, if we let the Ampersand programmer believe that <= is an operator, it is hard to explain why r<=s must be read as r;<=;s. After all, that notation suggests strongly that <= is a relation, yet it isn't. Confusion lures...

(It is a pity that idea nr. 3 has this problem because it seems quite implementable. But ease of implementation should not sacrifice a correct implementation of the idea. Fortunately, this idea lives in a separate branch.)

Idea's nr. 1 and 2 do not suffer from this problem, because <= is a (built-in) relation.

stefjoosten commented 2 years ago

@sjcjoosten has convinced me (in an oral discussion) that idea 2 is the way to go.

hanjoosten commented 2 years ago

I don't see the issue for idea 3 yet. <= should be regarded as the relation denoted by I[cpt] ; <= ; I[cpt], where cpt is a type variable for a concept. That concept is restricted by concepts having an ordering. This means that when a user would write the expression <= - I[A], it would be regarded as the expression (I[cpt] ; <= ; I[cpt]) - I[A], where cpt would become A. A perfectly sane way to specify a relation with infinate tuples in it. Of course, that may lead to insane behaviour. I will call @sjcjoosten soon to discuss this topic with him, to avoid too much noise in this thread.

stefjoosten commented 2 years ago

I'm now adding predefined concepts and classifications to the language. My test script is:

CONTEXT LessThanOrEqual

RELATION r[INTEGER*INTEGER]

RULE r;<=;r

ENDCONTEXT

I have added the following representations and specializations to the language (for readability in quasi-Ampersand syntax):

CLASSIFY DATE ISA NUM
CLASSIFY INTEGER ISA NUM

RELATION "<=" [NUM*NUM]

REPRESENT "ALPHANUMERIC" TYPE "ALPHANUMERIC"
REPRESENT "BIGALPHANUMERIC" TYPE "BIGALPHANUMERIC"
REPRESENT "HUGEALPHANUMERIC" TYPE "HUGEALPHANUMERIC"
REPRESENT "PASSWORD" TYPE "PASSWORD"
REPRESENT "BINARY" TYPE "BINARY"
REPRESENT "BIGBINARY" TYPE "BIGBINARY"
REPRESENT "HUGEBINARY" TYPE "HUGEBINARY"
REPRESENT "DATE" TYPE "DATE"
REPRESENT "DATETIME" TYPE "DATETIME"
REPRESENT "BOOLEAN" TYPE "BOOLEAN"
REPRESENT "INTEGER" TYPE "INTEGER"
REPRESENT "FLOAT" TYPE "FLOAT"
REPRESENT "OBJECT" TYPE "OBJECT"
REPRESENT "TYPEOFONE" TYPE "TYPEOFONE"

This yields:

sjo00577@BA92-VYF9TXMD9G Ampersand % ampersand check testing/issue1256.adl
/Users/sjo00577/git/Ampersand/testing/issue1256.adl:6:10 error:
  Ambiguous type when matching: Tgt of <= [NUM*NUM]
   and Src of r [A*A].
    The type can be A or DATE/INTEGER/NUM
    None of these concepts is known to be the smallest, you may want to add an order between them.
ExitFailure 10
stefjoosten commented 2 years ago

@sjcjoosten, can you take a look at these code snippets? I don't trust pClassify2aClassify. Can you remember the intended semantics of PClassify? I recall that the meaning of PClassify _ s gs is $atoms(s)\subseteq\bigcup(map atoms gs)$, but am I wrong? It does not seem to correspond with the following code. That code seems to have worked throughout the years, but it hardly sheds any light on the intended semantics...

      data PClassify = PClassify
        { pos :: Origin,
          -- | Left hand side concept expression
          specific :: P_Concept,
          -- | Right hand side concept expression
          generics :: NE.NonEmpty P_Concept
        }

      pClassify2aClassify :: ConceptMap -> PClassify -> Maybe AClassify
      pClassify2aClassify fun pg =
        case NE.tail (generics pg) of
          [] -> case filter (/= specCpt) [pCpt2aCpt fun . NE.head $ generics pg] of
            [] -> Nothing
            h : _ ->
              Just
                Isa
                  { genpos = origin pg,
                    gengen = h,
                    genspc = specCpt
                  }
          _ -> case NE.filter (/= specCpt) . fmap (pCpt2aCpt fun) $ generics pg of
            [] -> Nothing
            h : tl ->
              Just
                IsE
                  { genpos = origin pg,
                    genrhs = h NE.:| tl,
                    genspc = specCpt
                  }
        where
          specCpt = pCpt2aCpt fun $ specific pg

Specifically, is it true that CLASSIFY S ISA G translates (roughly, ignoring the NE bit) to PClassify o S {G, S} and CLASSIFY S IS G/\K/\L translates (roughly) to PClassify o S {G, K, L}?

sjcjoosten commented 2 years ago

I recall that the meaning of PClassify _ s gs is $atoms(s)\subseteq\bigcup(map atoms gs)$, but am I wrong?

I am confused, I would say that PClassify _ s gs means $atoms(s) = \bigcap(map atoms gs)$ and since CLASSIFY S ISA G translates to S = S /\ G, so that would be PClassify o S [G,S]. The names generic and specific seem to suggest that there is a subset involved, but I believe in the equality.

I am also a bit suspicious about pClassify2aClassify, as I read this code it seems like Isa is never produced, only IsE is, but you claim this code works?

stefjoosten commented 2 years ago

I have slept over it. You are right. We defined the meaning of PClassify _ s gs to be $atoms(s)=\bigcap(map atoms gs)$. I will redo the code because it is bound to be wrong. I think we just never saw the effects of it being wrong.

So, CLASSIFY Employee ISA Person translates to PClassify orig (PCpt "Person") {PCpt "Person") (PCpt "Employee"} (in the P-structure) and that translates to Isa orig (C "Person") (C "Employee") (in the A-structure). Similarly, CLASSIFY Eagle IS Bird/\Predator translates to PClassify orig (PCpt "Eagle") {PCpt "Bird",PCpt "Predator"} and that translates to IsE orig (C "Eagle") [C "Bird",C "Predator"].