Define input, ouput, intermediate data nodes

FlorianJacta commented 3 months ago

Description

The goal of this issue is to discuss what does input, ouput, intermediate data nodes mean.

Solution Proposed

To my mind, the concept of input, output and intermediate data nodes are relative to the DAG.

In my opinion, <data node>.is_input doesn't have a meaning for example by itself.

This concept should be attached to the objects representing a DAG:

Config (even if the config does not represent a DAG directly, it can represent mutiple scenario configs that creates a DAG)
Scenario / Scenario Config
Sequence

In other terms:

Config.inputs should return the list of inputs corresponding to the DAG created by all the Scenario Configs
Config.outputs should return the list of outputs corresponding to the DAG created by all the Scenario Configs
<Scenario>.inputs should return the list of inputs corresponding to the DAG created by THE Scenario (Config)
<Scenario>.outputs should return the list of outputs corresponding to the DAG created by THE Scenario (Config)
<Sequence>.inputs should return the list of inputs corresponding to the DAG created by THE Sequence
<Sequence>.outputs should return the list of outputs corresponding to the DAG created by THE Sequence

I think the inputs/outputs of interest for the Data Node Selector are the ones relative to the whole Config.

Impact of Solution

No response

Additional Context

No response

Acceptance Criteria

[ ] Ensure new code is unit tested, and check code coverage is at least 90%.
[ ] Create related issue in taipy-doc for documentation and Release Notes.
[ ] Check if a new demo could be provided based on this, or if legacy demos could be benefit from it.
[ ] Ensure any change is well documented.

Code of Conduct

[X] I have checked the existing issues.
[ ] I am willing to work on this issue (optional)

trgiangdo commented 2 months ago

For the global Config:

Config.inputs includes data nodes that are input of a task, but are not output of any tasks
Config.outputs includes data nodes that are output of a task, but are not input of any tasks

For <Scenario>.inputs, <Scenario>.outputs, <Sequence>.inputs, and <Sequence>.outputs, we already have similar APIs in the Submittable class. We can expose those if needed.

FlorianJacta commented 2 months ago

This seems right to me!

jrobinAV commented 2 months ago

The APIs you mentioned @trgiangdo are contextual, meaning they are Config or Submittable APIs. So, we can interpret the API as follows:

Config.inputs: From the Config standpoint, here are all the input data node configs.
Config.outputs: From the Config standpoint, here are all the output data node configs.
my_scenario.inputs: From the my_scenario standpoint, here are all the input data nodes.
my_scenario.my_sequence.outputs: From the my_sequence standpoint, here are all the output data nodes.
etc.

The question is slightly different, though. It concerns the default context when there is no explicit one. How can we answer the question, "Is this data node an input?" independently from any context? @FlorianJacta proposes using Config as the default context, but I am not sure it is intuitive enough. Moreover, the question has been raised in the data node selector filter, which is exposed to the end user. There is a high probability the end user does not know anything about the config DAGs.

Let's take a complex example.

from datetime import datetime
from taipy import Config, Core, Frequency, Scope, create_scenario

def identity(value):
    return value

d1 = Config.configure_data_node("d1", scope=Scope.GLOBAL)
d2 = Config.configure_data_node("d2", scope=Scope.CYCLE)
d3 = Config.configure_data_node("d3", scope=Scope.SCENARIO)
d4 = Config.configure_data_node("d4", scope=Scope.SCENARIO)

t1 = Config.configure_task("t1", function=identity, input=[d1], output=[d2])
t2 = Config.configure_task("t2", function=identity, input=[d2], output=[d3])

t3 = Config.configure_task("t3", function=identity, input=[d1, d2, d3], output=[d4])

s1 = Config.configure_scenario("s1", task_configs=[t1, t2],
                               sequences={"seq1": [t1], "seq2": [t2]},
                               frequency=Frequency.DAILY)
s2 = Config.configure_scenario("s2", task_configs=[t3], frequency=Frequency.DAILY)

Core().run()
scenario_1 = create_scenario(s1, datetime(2021, 1, 1))
scenario_2 = create_scenario(s1, datetime(2021, 1, 2))
scenario_3 = create_scenario(s2, datetime(2021, 1, 1))
scenario_4 = create_scenario(s2, datetime(2021, 1, 2))

The piece of code instantiates the following data nodes: One global scoped dn: d1 Two cycle scoped dns: scenario_1.d2, scenario_2.d2 Six scenario scoped dns: scenario_1.d3, scenario_2.d3, scenario_3.d3, scenario_4.d3, scenario_3.d4, scenario_4.d4

What are the inputs, the outputs, and the intermediate data nodes? As an end-user, I really don't know what I am expecting as an answer.

FlorianJacta commented 2 months ago

I need clarification on what is confusing about this. Why is the definition above not the expected definition?

jrobinAV commented 2 months ago

As an end user, listing all input data nodes is not self-explanatory. I need to well understand the whole config with all the scenario configs, all the sequences, etc. to understand what I am going to get.

Let's imagine I have a role that only allows me to view scenarios from the second scenarios config s2. So, I am expecting to get [d1, scenario_1.d2, scenario_2.d2, scenario_3.d3, scenario_4.d3] as a result when asking for inputs. Your proposal will only return [d1].

trgiangdo commented 2 months ago

Do we have a role system that can explicitly set the access role of a user to some specific scenarios? I did not know that.

Anyway, from the example that you declare: Config.inputs = [d1] Config.outputs = [d4] When we call Config..., the list will be a list of data node configuration.

For the scenario entities: scenario_1.inputs = [scenario_1.d1] scenario_1.outputs = [scenario_1.d3] scenario_1.seq_1.inputs = [scenario_1.d1] scenario_1.seq_1.outputs = [scenario_1.d2] scenario_1.seq_2.inputs = [scenario_1.d2] scenario_1.seq_2.outputs = [scenario_1.d3] scenario_2 is the same as scenario_1

scenario_3.inputs = [scenario_3.d1, scenario_3.d2, scenario_3.d3] scenario_3.outputs = [scenario_3.d4] scenario_4 is the same as scenario_3

The scope of the data node doesn't affect the outcome of these APIs I think

jrobinAV commented 2 months ago

@trgiangdo I was not specifically talking about Taipy enterprise roles. My example was confusing. Let me rephrase the sentence. 'Let's imagine I have a user interface on which I only view scenarios from the second scenarios config s2.'

What would be the result of tp.get_inputs(), without any explicit context? Or in other words, what would be the result of scenario_1.d2.is_input()? In such use case, I am expecting as an answer : tp.get_inputs() == [d1, scenario_1.d2, scenario_2.d2, scenario_3.d3, scenario_4.d3] scenario_1.d2.is_input() == True Both will be false with Florian proposal.

FlorianJacta commented 2 months ago

In my opinion, .is_input doesn't have a meaning for example by itself.

This is what I wrote in the issue.

tp.get_inputs() doesn't mean anything to me

A Data Node is input/output depending on the context.

trgiangdo commented 2 months ago

I don't think tp.get_inputs() or <DataNode>.is_input() are possible at all.

Me and Florian agree on the 6 APIs: Config.inputs, Config.outputs, <Scenario>.inputs, <Scenario>.outputs, <Sequence>.inputs, and <Sequence>.outputs, I think.

For the scenario_1.d2.is_input() == True, it is correct right? Since we are looking at the data node at scenario context. But I don't see how we can implement it, because it need to know which scenario is calling to it as well, so .is_input() is not possible and make no sense.

jrobinAV commented 2 months ago

Are you saying I should better read you description? 🤣 If so, I believe you are right...

I misunderstood your proposal. Sorry.

trgiangdo commented 2 months ago

So do we agree on the requirements now?

jrobinAV commented 2 months ago

After a better reading, I now understand the proposal. I am okay with the concepts exposed in the Taipy core package. But I believe it does not answer the issue, in particular on the sentence from the description that is, in the end, the root motivation of the issue:

"I think the inputs/outputs of interest for the Data Node Selector are relative to the whole Config."

I strongly believe, we don't want to expose the config inputs and outputs in the data node selector. The config is a developer concept, not an end-user concept. The end-user will not easily understand the input and output data nodes. What is needed in the data node selector is another concept that sometimes (mostly in demos) overlaps with the developer input-output data node concept. My understanding is that the end-user wants to access two kinds of data nodes quickly:

The ones to eventually edit so he/she can recompute the scenario, and propagate the changes to other data nodes. these data nodes don't match the config inputs, even if they have an overlap with the developer's inputs.
The ones to visualize and analyze to understand or validate a solution. These data nodes don't match the config outputs, even if they have an overlap with the developer's outputs.

jrobinAV commented 1 week ago

A tradeoff has been proposed. The idea is to display in the data node selector the data nodes with a scenario scope in topological order. With this proposal, the scenario is the context used to set the data node rank.

A more formal proposal should come soon.

Avaiga / taipy