Allow implicit definition of relations by .xlsx files

hanjoosten commented 3 years ago

There is a case to be made to define a relation implicitly, by means of the population statement: Either in an .adl file, or in the .xlsx file.

This issue is intended to discuss about this, and to see if we can agree on whatever outcome.

For concepts, we already have implicit definition. If a concept name occurs in whatever statement, Ampersand acts as if it was defined. For relations, if a relation is used anywhere, and it isn't explicitly defined, an error might be thrown. This depends if there is another relation defined with the same name and the source and target of that relation could match because of CLASSIFY statements.
There are several cases what could happen:

There is no relation with that name: A plain error message stating that the relation isn't defined.
There are relation(s) with the same name, but not exactly one of them match based on source and target. Then you get an error message listing the relations with that name, suggesting you should be more specific by using one of those relations. This was the case in issue #1127.

In RULE statements, this behaviour is fine, but in a population statement, especially if it comes from within an .xlsx file, this is hard to the user. In such a case, it could be very convenient to implicitly define that relation.

Opinions please, @stefjoosten , @sjcjoosten

sjcjoosten commented 3 years ago

This touches the type-checker, which means there are some technical limitations that separate the amount of work that needs to be done if we want this feature. Relations always have a source and target type, but typically we are given a bit more freedom about that type in places where a relation is used. If r[A*B] is declared, but r is used where the source type is C and the target type is D, then we often allow this use if C ISA A and D ISA B (the exact criterion varies). However, we determine what relation is used before we know whether C ISA A and D ISA B. That is: disambiguation is largely without knowledge on the way the concepts relate. This is an often debated but in my opinion desired feature: if the ISA-structure determines which relation is used, then there is the risk that changing an ISA statement changes the semantics of a rule without an error message.

As a consequence of the above, if we wish to allow implicit definitions, we need to require an exact match of types in those places. As an example: suppose users can be students or employees, all users email addresses, and there are two spreadsheets, one for students and one for employees. In the student excel, each row corresponds to an atom of type Student, and there is a email relation of concept 'Email', which would get the type information [Student*Email]. The other sheet also has an email field, but now of type information [Employee*Email]. If things were modeled this way, with relations implicit, two relations 'email' would be created.

A way to perhaps fix this in a workable way, is to allow full type specifications and only implicitly define a relation if there is such a full type specification. We can then write 'email' in both excel files and declare it in some ADL file, or we can write email[User*Email] in both excel files and have this relation be declared implicitly (if it does not yet exist).

Note: I've never used the excel-import feature myself and I'm not too familiar with what is currently allowed or not allowed, nor with what is a reasonable scenario vs a fringe use-case. I tried to come up with a reasonable scenario.

RieksJ commented 3 years ago

This reminds me of another issue that I searched for but could not find, in which we discussed whether or not in cases such as I /\ sessionStatus;"user is authorized";sessionStatus~ an atom with value user is authorized should automatically be added to the (initial) population. It was decided not to do that. Since I cannot find the actual issue, I also cannot say which arguments were used for that, but is seems to me that they may be relevant for this discussion as well.

Having said that, I think that automatically creating a new relation in an Excel sheet only defers the problem, because you will still have rules or interfaces in which you use another relation name for what you believe to have populated in the excel sheet, or vice versa. I am quite content with the Excel importer telling me I made a typo rather than attempting to 'help' me by creating a relation that I did not want.

However, prototypes have a second way of importing excel files (see e.g. #397), which uses interfaces: if you provide the name of the interface as the name of a tab in excel, and then use the labels of the interface as column headers, then you end up having an excel file that does not depend on how you name the relations, because all that is in the interface specification.

sjcjoosten commented 3 years ago

This reminds me of another issue that I searched for but could not find, in which we discussed whether or not in cases such as I /\ sessionStatus;"user is authorized";sessionStatus~ an atom with value user is authorized should automatically be added to the (initial) population.

One argument used for that is that it automatically adding atoms made it impossible to express that Sinterklaas does not exist (say, is not a person). Searching Sinterklaas helped me find the issue: https://github.com/AmpersandTarski/Ampersand/issues/166

stefjoosten commented 3 years ago

@hanjoosten ~~Consider it done!~~ You may use the RELATION-statement-with-population as a substitute for the POPULATION-statement that implies a relation. The following statement does exactly what you describe:

RELATION access [Trough*Mammal]
 = [("1", "aap")]

That works in an ADL-file. For the xlsx-file, I recall having built that relations are generated alongside the population. Precisely this purpose, I might add.

RieksJ commented 3 years ago

@stefjoosten: are you referring to the Haskell excel import, or the prototype excel import? Note that these are not the same! In the prototype that wouldn't work (and I'm glad it doesn't).

stefjoosten commented 3 years ago

Short answer: To the Haskell excel import.

Longer answer: Currently, you can use the command ampersand data-analysis to generate relation definitions from a .xlsx-file. You can then paste or import those definitions in the script that includes this .xlsx-file. The INCLUDE statement in Ampersand does NOT generate those relation definitions by default, nor is there an option to do so, because we don't like options that change the semantics of statements very much.

hanjoosten commented 3 years ago

@stefjoosten: are you referring to the Haskell excel import, or the prototype excel import? Note that these are not the same! In the prototype that wouldn't work (and I'm glad it doesn't).

TLDR: Haskell import only!

Long answer: During runtime (i.e. in the prototype) there is no means at all to add relations to the model. In general, that would imply to change the current database schema. So of course, this change can only affect the stage before the initial population is defined.

hanjoosten commented 3 years ago

@hanjoosten Consider it done! You may use the RELATION-statement-with-population as a substitute for the POPULATION-statement that implies a relation. The following statement does exactly what you describe:
RELATION access [Trough*Mammal]
 = [("1", "aap")]
That works in an ADL-file. For the xlsx-file, I recall having built that relations are generated alongside the population. Precisely this purpose, I might add.

This isn't done at all. Yes, the RELATION statement has an optional part to define population in it. This isn't relevant in this discussion. I am discussing the POPULATION statement, because that is handled in the same way by the typechecker as the handling of .xlsx file.

A testcase could be the following script, that currently fails, because of the use of an undefined relation:

CONTEXT Issue1137

POPULATION pred[State*State] CONTAINS
   [ ("2", "1")
   ; ("3", "2")
   ; ("4", "3")
   ]

ENDCONTEXT

stefjoosten commented 3 years ago

So apparently you propose that the statement (P)

POPULATION pred[State*State] CONTAINS
   [ ("2", "1")
   ; ("3", "2")
   ; ("4", "3")
   ]

gets the same semantics as (R)

RELATION pred[State*State] =
   [ ("2", "1")
   ; ("3", "2")
   ; ("4", "3")
   ]

Let me revisit the question you raised in the first comment of this issue, I propose to leave the syntax of (P) as is, so the compiler can latch this population onto an existing relation and conserve the full precision of error messages in the type system. For a population with an implicit definition of the relation, I propose the syntax of (R). This gives the user a syntax that has the minimal obligation to specify the relation (as @sjcjoosten point out above: you have to state name and type anyway if you want to define a relation).

As for your proposal: I don't see a purpose for having the same semantics for these two statements.

Now back to your claim "This isn't done at all". I obviously must be missing something. Please explain.

hanjoosten commented 1 year ago

In PR #1252 A change was made to the code that broke this issue.

The intent is that ONLY when using ampersand data-analysis with an .xlsx file, the relations are added implicitly. However, since this change, ANY command that uses an .xlsx file will add the relations implicitly. This is wrong.

The P_Context that is created inside xlsx.hs should take into account whether or not the data-analysis command is used. Only when this is the case, the relations should be incorporated into the P_Context.

AmpersandTarski / Ampersand

Allow implicit definition of relations by .xlsx files #1137