stefjoosten commented 3 years ago

Problem

The following screenshot shows that some relations, e.g. datetime, have excessive amounts of instances: From a practical point of view, there is no reason to keep such data stored in the database. It is unsure whether this leads to performance problems, which we won't analyze in the current issue. This issue is weakly related to issue #560.

Analysis

At the moment all atoms are stored in a concept list. For atoms like DATETIME that are stored in the concept table DateTime, this is clearly excessive. So not every concept needs a concept list in the database. But which? Of which concepts can we prove at compile-time that they don't need a concept table? (sidestep: The Haskell function sqlConceptPlug :: FSpec -> A_Concept -> PlugSQL gives us the concept table in the compiler.)

So when is a concept table used? For this, we have to analyze terms, which are the only source of using concept tables. If a term contains a subtermI[C], the compiler generates a query that uses the concept table of C.

# Proposed Solution # Impact

RieksJ commented 3 years ago

I don't think there is a real issue here considering all this is meant to run prototypes. It may be different if one were to take the step to production.

Here is my argumentation: First, I think that in prototypes, only the concept DateTime and the relation lastAccess[SESSION*DateTime suffer from such large populations. The reason is that in the log files, I see no other tables than DateTime and lastAccess being populated all the time. I would have noticed that, and I would also have tried to color these parts of the log files grey so it would not distract developers from the 'real' populations they want to see:

I guess it would become a problem if memory consumption became too big. That might also be a signal to a developer to get a real computer, as databases should be able to cope with lots of data. Another problem might be that operations such as ; or <> might need too much time. That could be(come) very real, but afaik, the relation lastAccess isn't used by anyone but me, because developers do not know it exists. I use it to generate error messages that show on the screen for a limited time; with a time-out (saving the value of lastAccess when showing the error message on the screen, and comparing it with the current value of lastAccess in subsequent exec-engine runs allows me to show the message for at least (say) 5 seconds, after which the next interaction will erase it.

Michiel-s commented 3 years ago

Hi @hanjoosten, I would like to reopen this issue and discuss a few options here first. I do recognize the problem addressed here. Having all atoms with scalar types stored in a concept table also has performance issues regarding amount of conjuncts checked.

I need some more time to formulate a proposal and think it's valuable to discuss that in person with you and/or @stefjoosten first.

Michiel-s commented 3 years ago

After discussion with @hanjoosten we decided to see if a solution is feasible. We think it is, with limited effort.

A proposal will follow soon.

hanjoosten commented 3 years ago

So when is a concept table used? For this, we have to analyze terms, which are the only source of using concept tables. If a term contains a subtermI[C], the compiler generates a query that uses the concept table of C.

This sounds good. However, is this true in general? If so, it would be ease to split all concepts into two categories. Ones that do and ones that don't need a concepttable. The TTYPE of such concept is irrelevant to this splitting. @sjcjoosten , do you agree with @stefjoosten on this?

sjcjoosten commented 3 years ago

V and I both use concept tables, and are as far as I know its only source, although the concept tables don't always have to be used whenever a V or I is used: in expressions like r;V;s or r /\ I, we don't even need a concept table. As a first step, we can choose to disregard this and treat such expressions as using concept tables. However, that means that the performance issues regarding amount of conjuncts checked might not change (as the code that checks the conjunct in itself may introduce the concept tables). Note that the Epsilon relations might stand in for I as well.

I'm not sure at what point in the compiler we can be aware of 'all expressions', and there are several caveats here:

I'm afraid that rules that call the exec-engine might refer to expressions that we might miss. I don't know the input language for the exec engine, but Michiel does, so he might be able to tell us whether this change might potentially make the writing of exec-engine rules any harder than it already is.
The compiler performs certain optimisations on the relation algebra as well as the SQL. As a first step, I would suggest only looking at terms after said optimisations. However, that might be annoying in practice: something seemingly trivial like adding an ISA could cause a concept table to be generated. Especially so as the prototype grows in complexity and these optimisations become more necessary, it will become more and more difficult to avoid generating these tables, and I don't think that this is something that can be solved in the compiler alone: I believe there might exist scripts that force the compiler to choose between generating one of two concept tables without a mathematical reason as to why it should prefer one over the other. Such information would therefore need to come from the user. If we want to eventually go that route, and I think this should not keep us from taking the first step first, we can, and then this issue becomes related to #722.

Michiel-s commented 3 years ago

I'm afraid that rules that call the exec-engine might refer to expressions that we might miss. I don't know the input language for the exec engine, but Michiel does, so he might be able to tell us whether this change might potentially make the writing of exec-engine rules any harder than it already is.

Input language for the exec engine are Ampersand expressions in the violation statement of these rules. The exec-engine, nor the prototype framework itself works with expressions. Only generated output, including SQL, from the compiler. Therefore, I don't see any issues here.

We do however need to adapt the prototype framework to such a change, because the code needs to be aware if there is a concept table that needs to be (de)populated when the user is interacting with the application. That info needs to come from the compiler. In the generated output file concepts.json we could do that as follows:

Current output for e.g. a DateTime concept:

[{
    "id": "DateTime",
    "label": "DateTime",
    "type": "DATETIME",
    "defaultViewId": null,
    "conceptTable": {
        "name": "DateTime",
        "cols": [ "DateTime" ]
    },
    "directGens": [],
    "directSpecs": [],
    "generalizations": [],
    "specializations": [],
    "largestConcept": "DateTime",
    "interfaces": [],
    "affectedConjuncts": [ "conj_123", "etc" ]
}]

will become: notice the null for the conceptTable

[{
    "id": "DateTime",
    "label": "DateTime",
    "type": "DATETIME",
    "defaultViewId": null,
    "conceptTable": null,
    "directGens": [],
    "directSpecs": [],
    "generalizations": [],
    "specializations": [],
    "largestConcept": "DateTime",
    "interfaces": [],
    "affectedConjuncts": [ "conj_123", "etc" ]
}]

I think that will do.

AmpersandTarski / Ampersand

Don't store scalar atoms in own concept table if not needed #1156

Problem

Analysis