Interactions-HSG / grpcwot

A simple command line tool to generate a Thing Description from Protocol Buffers
GNU General Public License v3.0
0 stars 0 forks source link

feat: Basic Interaction Affordance Classification and CLI #14

Open S-Eibl opened 2 years ago

S-Eibl commented 2 years ago

The following changes are included in this PR:

iomz commented 2 years ago

I have several requests before discussing the content:

S-Eibl commented 2 years ago

Further documentation of this PR:

Message Parsing

Messages in the proto file consist of fields. These fields can have scalar value types or message types. This PR extends the list of scalar value types available to the list provided here. If fields are of type message they reference to another message defined in the proto file. As the Walk pattern to scan through messages proposed by the emicklei/proto package does not allow for this cross referencing the refences have to be stored in an intermediate slice in the builder (lm). This slice consisting of tuples storing (parent message containing the field (pm), the fields type (also the name of the corresponding message) (t) and the fields name (n)). These values are necessary to extend the parent message with the referenced message's structure afterwards. By iterating through the slice - after the walk over the messages has filled a map containing all messages as data schemas - these data schemas could be expanded by the missing fields. Therefor it is important to start with those messages that do not reference other messages, otherwise data structures would have to be filled at different places in the resulting map, making it very complex and ineffective to search for gaps. The slice of referenced messages forms a graph that has to be traversed from the lowest levels upwards. Messages could also be nested and if they are nested they could be referenced from there outer scope by there parent message, which encloses the nested message, as well as by their direct name from inside the scope. To cope with this problem a messages name should be stored with its full path in the map of messages. As a slice sounds suitable for this, using a slice as key for a map is not efficient and needs further implementations. Therefor they are stored in string format the same way as they would be addressed from the global scope (all upper levels delimited by a dot ending with the messages name itself). As this stores the exact path to a specific nested message it becomes possible to search the message being closest to the fields scope that references this message. Knowing the scope of the field and having messages stored in combination with their scope allows traversing up the path based on the fields path searching for the message with the closest scope.

Automatic Interaction Affordance Classification

By defining some rules for a basic classification into properties, actions or events the RPC functions from the proto file can be provisionally classified. These rules look for get and set functions with their corresponding request and return type to find properties. After that for rpc functions with no request and only an return type for events and everything which has not been filtered is assigned to possibly be an action. In case of properties there might be a get and a set function. While walking through the RPC functions, if a function fulfills basic rules to be a property, the already stored properties are scanned, if there is a corresponding getter or setter function. In this case they are stored together to form one property.

Manual Confirmation

As automatic classification already introduced a basic classification there might be wrong assertions. To handle those wrong assertions the cli's user has the possibility to change the assignment of an interaction affordance. The program therefor prints out every single RPC function with its corresponding classification on the console and the user can either agree with the assignment by pressing enter or disagree and reasign the RPC function to a different interaction affordance class by typing in (r) for property, (a) for action and (e) for event. If the assignment is changed the RPC function is shifted to another class. In case of properties this could effect the combination of 2 RPC functions (get/set).

Predefined Configuration

Entering or approving all RPC function manualy could become cumbersome, especially when proto files are bigger with an huge number of RPC functions. Therefor the program not only writes the resulting TD file, it also writes a json file containing the classification applied on the RPC functions. This file lists all functions together with the interaction affordance class they were assigned to and the name of the interaction affordance which corresponds to that RPC function. This makes it easier to repeat the generation process of a TD from the proto file if necessary. The CLI therefor allows to be run with the -c flag, where the config files source can be specified.

Reduced Generation

The CLI also allows the -r flag. With this flag it is possible to hand over a list of RPC functions that should be included in the Thing Description in json format. When running in reduced mode only those RPC functions will be scanned and converted into interaction affordances.

Output changes

The value passed in the -o flag is no longer the name and path of the output file (TD) itself. It should now be set to the directory where the Thing Description and the classification configuration json should be stored.

iomz commented 2 years ago

Further documentation of this PR:

Message Parsing

Messages in the proto file consist of fields. These fields can have scalar value types or message types. This PR extends the list of scalar value types available to the list provided here.

If fields are of type message they reference to another message defined in the proto file.

I didn't understand what this means. Can you give me an example?

As the Walk pattern to scan through messages proposed by the emicklei/proto package does not allow for this cross referencing the refences have to be stored in an intermediate slice in the builder (lm).

This slice consisting of tuples storing (parent message containing the field (pm), the fields type (also the name of the corresponding message) (t) and the fields name (n)).

Also an example?

These values are necessary to extend the parent message with the referenced message's structure afterwards. By iterating through the slice - after the walk over the messages has filled a map containing all messages as data schemas - these data schemas could be expanded by the missing fields.

What are the missing fields?

Therefor it is important to start with those messages that do not reference other messages, otherwise data structures would have to be filled at different places in the resulting map, making it very complex and ineffective to search for gaps. The slice of referenced messages forms a graph that has to be traversed from the lowest levels upwards.

Also, please provide an example of the graph.

Messages could also be nested and if they are nested they could be referenced from there outer scope by there parent message, which encloses the nested message, as well as by their direct name from inside the scope.

To cope with this problem a messages name should be stored with its full path in the map of messages.

What is and why is the problem?

As a slice sounds suitable for this, using a slice as key for a map is not efficient and needs further implementations. Therefor they are stored in string format the same way as they would be addressed from the global scope (all upper levels delimited by a dot ending with the messages name itself). As this stores the exact path to a specific nested message it becomes possible to search the message being closest to the fields scope that references this message. Knowing the scope of the field and having messages stored in combination with their scope allows traversing up the path based on the fields path searching for the message with the closest scope.

Example?

Automatic Interaction Affordance Classification

By defining some rules for a basic classification into properties, actions or events the RPC functions from the proto file can be provisionally classified. These rules look for get and set functions with their corresponding request and return type to find properties.

Does it scale? Or is it assumed that the source is strictly following a setter/getter naming convention? How does it look up? regex?

Example and test case?

After that for rpc functions with no request and only an return type for events

Example and test case?

and everything which has not been filtered is assigned to possibly be an action.

Example and test case?

In case of properties there might be a get and a set function. While walking through the RPC functions, if a function fulfills basic rules to be a property, the already stored properties are scanned, if there is a corresponding getter or setter function. In this case they are stored together to form one property.

How are they scanned?

Manual Confirmation

As automatic classification already introduced a basic classification there might be wrong assertions. To handle those wrong assertions the cli's user has the possibility to change the assignment of an interaction affordance. The program therefor prints out every single RPC function with its corresponding classification on the console and the user can either agree with the assignment by pressing enter or disagree and reasign the RPC function to a different interaction affordance class by typing in (r) for property,

Not (p)?

(a) for action and (e) for event. If the assignment is changed the RPC function is shifted to another class. In case of properties this could effect the combination of 2 RPC functions (get/set).

OK, I got this part. An interactive example?

Predefined Configuration

Entering or approving all RPC function manualy could become cumbersome, especially when proto files are bigger with an huge number of RPC functions. Therefor the program not only writes the resulting TD file, it also writes a json file containing the classification applied on the RPC functions. This file lists all functions together with the interaction affordance class they were assigned to and the name of the interaction affordance which corresponds to that RPC function. This makes it easier to repeat the generation process of a TD from the proto file if necessary. The CLI therefor allows to be run with the -c flag, where the config files source can be specified.

Why is it a json file? I can't propose a better alternative (e.g., a log file) either but was there any design decision to be recycle/reused later?

Reduced Generation

The CLI also allows the -r flag. With this flag it is possible to hand over a list of RPC functions that should be included in the Thing Description in json format. When running in reduced mode only those RPC functions will be scanned and converted into interaction affordances.

Otherwise what will be included? Example?

Output changes

The value passed in the -o flag is no longer the name and path of the output file (TD) itself. It should now be set to the directory where the Thing Description and the classification configuration json should be stored.

OK.

iomz commented 2 years ago

I have several requests before discussing the content:

  • don't reuse the branch name for the other PR, and make it descriptive.
  • see this comment on Stack Overflow: "regular file names are lower case, short, and without any sort of underscore or space". It is very unusual to see lower camel cases. The file names for the test should correspond to the package (i.e., there should be only two tests at the moment: builder_test.go and cmd/prototd/main_test.go.
  • in the test files, comments on what it is testing, why it is testing, and test cases for both fail and pass are missing.
  • the code in test_utils.go should reside either in the package or the corresponding test file.
  • as a baseline, check with the linter before committing changes:
14:34 iomz@helianthus:~/go/src/github.com/Interactions-HSG/grpcwot % golangci-lint run
configFileBasedClassification_test.go:10:6: `expectedAffClass_` is unused (deadcode)
type expectedAffClass_ struct {
     ^
configFileBasedClassification_test.go:16:6: `newEmptyExpectedAffClass_` is unused (deadcode)
func newEmptyExpectedAffClass_() *expectedAffClass {
     ^
messageParsing_test.go:78:6: `getBuilderDataSchemasAsJSON` is unused (deadcode)
func getBuilderDataSchemasAsJSON(b *builder) (j []byte) {
     ^
builder.go:174:2: S1006: should use for {} instead of for true {} (gosimple)
        for true {
        ^
builder.go:354:2: S1006: should use for {} instead of for true {} (gosimple)
        for true {
        ^
builder.go:261:20: S1019: should use make([]bool, len(b.lm)) instead (gosimple)
        a := make([]bool, len(b.lm), len(b.lm))
                          ^
builder.go:157:3: S1036: unnecessary guard around map access (gosimple)
                if _, ok := b.af.props[c.Name]; ok {
                ^
configFileBasedClassification_test.go:56:2: ineffectual assignment to err (ineffassign)
        err = json.Unmarshal(byteValue, &b.ac)
        ^

To start with, can you please work on these?

S-Eibl commented 2 years ago

Further documentation of this PR:

Message Parsing

Messages in the proto file consist of fields. These fields can have scalar value types or message types. This PR extends the list of scalar value types available to the list provided here.

If fields are of type message they reference to another message defined in the proto file.

I didn't understand what this means. Can you give me an example?


message Test {
int32 test2 = 1;                  // field of type scalar value in particular int32
LinkedTest linkedTest = 2;        // field of type message, referencing the message LinkedTest as type of the field
}

message LinkedTest { int32 test1 = 1; // field of type scalar value in particular int32 }

> 
> > As the Walk pattern to scan through messages proposed by the [emicklei/proto package](https://github.com/emicklei/proto) does not allow for this cross referencing the refences have to be stored in an intermediate slice in the builder (lm).
> 
> > This slice consisting of tuples storing (parent message containing the field (pm), the fields type (also the name of the corresponding message) (t) and the fields name (n)).
> 
> Also an example?
>
The stored messages in form of DataSchemas for the previous example before the injection is done would look like this:

{ "Test": { "type": "object", "properties": { "test2": { "type": "integer" } } }, "LinkedTest": { "type": "object", "properties": { "test1": { "type": "integer" } } } }

The slice lm for this example contains one tuple `{pm: "Test", t: "LinkedTest", n: "linkedTest"}` to store the information that message "Test" has a field of type "LinkedTest" with name "linkedTest" which refers to another message. This information is sufficient to inject the DataSchema of "LinkedTest" into the field "linkedTest" in the message "Test" after all messages were parsed to DataSchemas. After the injection the data schemas look like this:

{ "Test": { "type": "object", "properties": { "test2": { "type": "integer" }, "linkedTest": { "type": "object", "properties": { "test1": { "type": "integer" } } } } }, "LinkedTest": { "type": "object", "properties": { "test1": { "type": "integer" } } } }


> 
> > These values are necessary to extend the parent message with the referenced message's structure afterwards. By iterating through the slice - after the walk over the messages has filled a map containing all messages as data schemas - these data schemas could be expanded by the missing fields.
> 
> What are the missing fields?

I described fields with non scalar value types as missing fields in this stage, because when messages are converted to DataSchemas these fields are not added to the DataSchema right away. In this step "missing or left out fields" depend on other messages that maybe still have to be parsed according to their position in the proto file. Those will be added in the next step. Therefor the slice of tuples (lm) contains the information where and what fields have to be added to specific DataSchemas. 
> 
> > Therefor it is important to start with those messages that do not reference other messages, otherwise data structures would have to be filled at different places in the resulting map, making it very complex and ineffective to search for gaps. The slice of referenced messages forms a graph that has to be traversed from the lowest levels upwards.
> 
> Also, please provide an example of the graph.

The following graph shows only messages and not data schemas. This is not 100% correct as messages are transformed to DataSchemas before insertion is done, but its easier to read when I do it this way.
![image](https://user-images.githubusercontent.com/77973550/166931419-4d1ed719-5cf2-4d13-ab8d-19cfaf584985.png)
With this simple example containing 4 messages and 3 tuples in lm, there are different possibilities in which order the tuples can be processed. Following the slices order the conversion process would look like this:
Step1: Resolving message Test1:
![image](https://user-images.githubusercontent.com/77973550/166931843-24bbe44c-4227-4f03-bc23-4fe8577f39eb.png)
Step2: Resolving message Test2:
![image](https://user-images.githubusercontent.com/77973550/166932083-490558bf-c64d-4ba0-9d34-ac20ac879135.png)
Following this order in step 1 a new entry had to be added in lm, as message Test1 holds a reference to Test2 and Test1 is copied and injected into Test, and there are four injections necessary. The easier way is to start with the lowest level, which does not occur as parent message (pm) in the slice of references (lm):
Step1: Resolving message Test2:
![image](https://user-images.githubusercontent.com/77973550/166933118-f7cd1779-24ea-4513-ace8-44613afef8b6.png)
Step2: Resolving message Test1:
![image](https://user-images.githubusercontent.com/77973550/166933193-5d10fd70-c4d2-4049-a3ac-b9ccf9e04dfa.png)

> 
> > Messages could also be nested and if they are nested they could be referenced from there outer scope by there parent message, which encloses the nested message, as well as by their direct name from inside the scope.
> 
> > To cope with this problem a messages name should be stored with its full path in the map of messages.
> 
> What is and why is the problem?

Considering the following messages:

message M1 { message M2 { } message M4 { M2 t = 1; } } message M2 { } message M3 { M2 t = 1; }

For the field of type `M2` in `M3` the message `M2` from the outer scope should be selected. For the field of type `M2` in `M4` the type should refer to the message `M1.M2`. The problem was that if the DataSchemas are stored in a map with only the messages name as index, there would be a conflict between `M2` and `M1.M2`. Therefor the hole path of nesting has to be stored as index and if a field of a nested message refers to a message, the message with correct name and closest scope has to be selected. The term problem was not ideal, I should have used something like circumstance.
> 
> > As a slice sounds suitable for this, using a slice as key for a map is not efficient and needs further implementations. Therefor they are stored in string format the same way as they would be addressed from the global scope (all upper levels delimited by a dot ending with the messages name itself). As this stores the exact path to a specific nested message it becomes possible to search the message being closest to the fields scope that references this message. Knowing the scope of the field and having messages stored in combination with their scope allows traversing up the path based on the fields path searching for the message with the closest scope.
> 
> Example?

message M1 { message M2 { } message M3 { message M4 { M2 t = 1; } } } message M2 { }

When searching for the field type `M2` in `M4` the search should identify `M1.M2` as the correct message type as the name fits and it is closest to M4s scope. The search should therefor compete in the following order:
Search for `M2` in `M1.M3.M4` meaning check if the map of DataSchemas contains the key `M1.M3.M4.M2`: This is not the case
Search for `M2` in `M1.M3` => `M1.M3.M2`: Still not found
Search for `M2` in `M1` => `M1.M2`: This identifies the correct message/DataSchema
> 
> > ### Automatic Interaction Affordance Classification
> > By defining some rules for a basic classification into properties, actions or events the RPC functions from the proto file can be provisionally classified. These rules look for get and set functions with their corresponding request and return type to find properties.
> 
> Does it scale? Or is it assumed that the source is strictly following a setter/getter naming convention? How does it look up? regex?

Actually the classsification assumes that it follows the naming convention for setter/getter. I was already thinking about extending these check functions or modifying them. It might be interesting to provide characteristics dynamicaly or store decisions changing a classification made by the user to identify common criterias. Maybe we could discuss these ideas in an issue
> 
> Example and test case?
> 
> > After that for rpc functions with no request and only an return type for events
> 
> Example and test case?
> 
> > and everything which has not been filtered is assigned to possibly be an action.
> 
> Example and test case?

Oh, I'm very sorry, I now see that I forgot to include the test files. There are examples and test cases for this bahviour
> 
> > In case of properties there might be a get and a set function. While walking through the RPC functions, if a function fulfills basic rules to be a property, the already stored properties are scanned, if there is a corresponding getter or setter function. In this case they are stored together to form one property.
> 
> How are they scanned?

There is a map of properties that is filled during the walk through RPC functions. As long as we assume that RPC functions fulfill the naming conventions, we can assume that for example an RPC function `setState` would target a property `State` and if there is an entry in the map for the key `State` there is the possibility to check, if the setter fits the getter stored for this property.
> 
> > ### Manual Confirmation
> > As automatic classification already introduced a basic classification there might be wrong assertions. To handle those wrong assertions the cli's user has the possibility to change the assignment of an interaction affordance. The program therefor prints out every single RPC function with its corresponding classification on the console and the user can either agree with the assignment by pressing enter or disagree and reasign the RPC function to a different interaction affordance class by typing in (r) for property,
> 
> Not (p)?

Thank you, yes, that was a typo
> 
> > (a) for action and (e) for event. If the assignment is changed the RPC function is shifted to another class. In case of properties this could effect the combination of 2 RPC functions (get/set).
> 
> OK, I got this part. An interactive example?

The following interaction affordances are already classified according to specific criterias.If you want to change the classification for a specific affordance please enter

It was a simple and light solution as we already have json marshalling used in the application and json structure fulfills the requirements to store relations, but looking for alternatives might be interesting.

Reduced Generation

The CLI also allows the -r flag. With this flag it is possible to hand over a list of RPC functions that should be included in the Thing Description in json format. When running in reduced mode only those RPC functions will be scanned and converted into interaction affordances.

Otherwise what will be included? Example?

If the -r flag is not set, the program will include every RPC function provided in a valid proto file. An example for a reduced configuration file would look like this:

[
  "GetMode",
  "SetPosition"
]

This forces the program to build a Thing Description containing only the affordances generated for the RPC functions "GetMode" and "SetPosition" with corresponding DataSchemas.

Output changes

The value passed in the -o flag is no longer the name and path of the output file (TD) itself. It should now be set to the directory where the Thing Description and the classification configuration json should be stored.

OK.

iomz commented 2 years ago

I'm splitting this super long thread.

iomz commented 2 years ago

This slice consisting of tuples storing (parent message containing the field (pm), the fields type (also the name of the corresponding message) (t) and the fields name (n)).

Also an example?

The stored messages in form of DataSchemas for the previous example before the injection is done would look like this:

What do you mean by "the injection" in the first place?

iomz commented 2 years ago
{
    "Test": {
        "type": "object",
        "properties": {
            "test2": {
                "type": "integer"
            }
        }
    },
    "LinkedTest": {
        "type": "object",
        "properties": {
            "test1": {
                "type": "integer"
            }
        }
    }
}

The slice lm for this example contains one tuple {pm: "Test", t: "LinkedTest", n: "linkedTest"} to store the information that message "Test" has a field of type "LinkedTest" with name "linkedTest" which refers to another message. This information is sufficient to inject the DataSchema of "LinkedTest" into the field "linkedTest" in the message "Test" after all messages were parsed to DataSchemas. After the injection the data schemas look like this:

I believe the name "tuple" in your code is misleading. Also, the property names "pm", "t", "n", and several other variables severely degrade the readability of the code, it takes others very long time to interpret what they are. Can you please give more semantically understandable names in your code?

iomz commented 2 years ago

What are the missing fields?

I described fields with non scalar value types as missing fields in this stage, because when messages are converted to DataSchemas these fields are not added to the DataSchema right away. In this step "missing or left out fields" depend on other messages that maybe still have to be parsed according to their position in the proto file. Those will be added in the next step. Therefor the slice of tuples (lm) contains the information where and what fields have to be added to specific DataSchemas.

So, basically those message type fields that needs to be "filled out" later by your program? I think "missing" is misleading as it is actually not missing from the original proto, but is during the process of your design. You see, if you use the terms that are not defined anywhere, it can confuse others.

iomz commented 2 years ago

Also, please provide an example of the graph.

The following graph shows only messages and not data schemas. This is not 100% correct as messages are transformed to DataSchemas before insertion is done, but its easier to read when I do it this way. image With this simple example containing 4 messages and 3 tuples in lm, there are different possibilities in which order the tuples can be processed. Following the slices order the conversion process would look like this: Step1: Resolving message Test1: image Step2: Resolving message Test2: image Following this order in step 1 a new entry had to be added in lm, as message Test1 holds a reference to Test2 and Test1 is copied and injected into Test, and there are four injections necessary. The easier way is to start with the lowest level, which does not occur as parent message (pm) in the slice of references (lm): Step1: Resolving message Test2: image Step2: Resolving message Test1: image

So, by the "graph" you meant the "dependency" among the message fields? The parser starts with primitive messages that only contains scalar fields, then other messages that uses those primitive messages already scanned, and continue until all of the messages in the namespace can be linked to each other.

I think the generalization is a good thing to do, but my question is, with the given time limit, isn't it more useful for your thesis to test only on xArm7's proto file? (You are not creating a new efficient link resolution algorithm)

iomz commented 2 years ago

To cope with this problem a messages name should be stored with its full path in the map of messages.

What is and why is the problem?

Considering the following messages:

message M1 {
  message M2 {
  }
  message M4 {
    M2 t = 1;
  }
}
message M2 {
}
message M3 {
  M2 t = 1;
}

For the field of type M2 in M3 the message M2 from the outer scope should be selected. For the field of type M2 in M4 the type should refer to the message M1.M2. The problem was that if the DataSchemas are stored in a map with only the messages name as index, there would be a conflict between M2 and M1.M2. Therefor the hole path of nesting has to be stored as index and if a field of a nested message refers to a message, the message with correct name and closest scope has to be selected. The term problem was not ideal, I should have used something like circumstance.

Ok, I understood it as another wording issue.

"For the field of type M2 in M3 the message M2 from the outer scope should be selected" – selected by who?

"The problem was that if the DataSchemas are stored in a map with only the messages name as index, there would be a conflict between M2 and M1.M2" – now I understood why giving the whole path, but isn't it a corner case and also, simply a poor design of the proto that you don't really need to care about for now?

iomz commented 2 years ago

Does it scale? Or is it assumed that the source is strictly following a setter/getter naming convention? How does it look up? regex?

Actually the classsification assumes that it follows the naming convention for setter/getter. I was already thinking about extending these check functions or modifying them. It might be interesting to provide characteristics dynamicaly or store decisions changing a classification made by the user to identify common criterias. Maybe we could discuss these ideas in an issue

OK, let's work on a separate PR. Why don't you already open an issue?

iomz commented 2 years ago

Oh, I'm very sorry, I now see that I forgot to include the test files. There are examples and test cases for this bahviour

Please, see the guidelines for languages of your project. You tend to name everything in Pascal case (esp. file names), but ProtoBuf should be in snake case: https://developers.google.com/protocol-buffers/docs/style#file_structure

iomz commented 2 years ago

How are they scanned?

There is a map of properties that is filled during the walk through RPC functions. As long as we assume that RPC functions fulfill the naming conventions, we can assume that for example an RPC function setState would target a property State and if there is an entry in the map for the key State there is the possibility to check, if the setter fits the getter stored for this property.

Related to the above issue candidate for naming convention, I personally don't like this setter/getter name-based classification. As we are discussing it within this thread itself, what happens if one SDK implements APIs in a PascalName and another does it with snake_case? What does the first letter capitalization (or lower case) mean? Do we care about the taxonomy (e.g., State) in the code corresponding to an important concept? How do we maintain the semantic integrity between the different IDLs? ...

If you implement a rule in the code that depends on setter/getter for property, I suspect this would end up to unnecessary/unwanted abstraction during the IDL translation. To this end, I like the other two approaches (interactive/config-based).

iomz commented 2 years ago

Otherwise what will be included? Example?

If the -r flag is not set, the program will include every RPC function provided in a valid proto file. An example for a reduced configuration file would look like this:

[
  "GetMode",
  "SetPosition"
]

This forces the program to build a Thing Description containing only the affordances generated for the RPC functions "GetMode" and "SetPosition" with corresponding DataSchemas.

Why is it useful? Isn't it just better to filter those affordances AFTER generating a full TD by a simple JSON parser (i.e., separation of concerns)?

iomz commented 2 years ago

@S-Eibl When you are done with the changes, please request a review on the top right of this page. But since the code already includes changes that refer to many things at the same time, expect some time to review...

Please eliminate short-names from the code at least. I think those reassignments

    mN := m.Name
    p := m.Parent

are also not necessary and severely degrade the readability.