Enhancement draft: record / struct support in Structorizer

codemanyak commented 7 years ago

As already mentioned in other issues, a support for heterogeneous data structures with named components (called "record" in Pascal-like languages or "struct" in C-like languages) is proposed. The support shall comprise

Executor,
Code import,
Code export.

The record concept is intended to work in a similarly incremental way as arrays do in Structorizer, (i.e. you may add components at run time).

Access syntax should be similar to C and Java i.e. component names ar to be appended to the record variable name, separated by a dot, e.g.: date.month. Component names are to comply with identifier syntax (sequence of letters, digits, underscores, not beginning with a digit).
An assignment foo.bar <- 17 is only allowed ~~if variable foo hasn't existed before or~~ if it is a record variable already.
- ~~If variable "foo" hasn't been defined before, then an assignment foo.bar <- 17 is to create variable foo as a record variable containing an integer component bar.~~
- If variable foo had been existing as record variable then either the value of existing component bar is updated ~~or a new component bar is added~~.
- If foo is a constant then any assignment to a component is illegal.
Reading access to a component foo.bar is only legal if foo is a record variable with an (initialized!) component bar.
Initialization expressions might look like this (component names MUST always be given): today <- {year: 2017, month: 6, day: 26}. If it makes syntax analysis more feasible, a prefix keyword like record (EDIT: or rather the name of the defined type) might be prescribed (i.e. record{year: 2017, month: 6, day: 26}).
Such an initialization expression is one of the two only ways to establish a constant record like in const beginOfEra <- {year: 1970, month: 1, day: 1}
Now, for assignments among record variables (e.g. yesterday <- today), the following rules seem sensible:
1. The target variable will be overridden by the source variable, ~~no matter what structure or value it had had before~~.
2. Both variables refer to the same object afterwards (this ~~might get difficult to guarantee with the Executor interpreter, but it~~ should meet the same expectations the array assignment provides).
3. If the source variable is known to be constant then the target variable will NOT share its value but get a (mutable) copy.
4. If the target is an existing constant then the assignment is illegal.
5. If the target is a new constant then it will be set with an immutable copy of the source record. (This is the second of the two only ways to define a constant record.)
Records (maybe initializer expressions or variables) are not iterable and cannot be used as item suppliers of a FOR-IN loop.
Ideally, the components of a record may be of any type (scalar, string, arrays, records).

The internal implementation would sensibly be done by key-value pairs, i.e. as a map where the component names are the keys. This means the user cannot conclude the exact physical order and number of the components.

codemanyak commented 7 years ago

The decisive flaw of the draft outlined above is that it makes sensible code generation to typed languages practically impossible. Due to the possibility to overwrite a variable with a value of a completely different type, due to the possibility that a subroutine might add further components to a record variable passed in as argument (call by reference!), there would be no chance to identify the exact record structure of a variable by syntactical (static) analysis. This is lightyears worse than with arrays. But what's the wayout here? To force a strict explicit declaration with some record type definition, and prevent any re-typing assignment, any runtime addition of components? Seems that way, unfortunately.

GitMensch commented 7 years ago

This sounds like a very good plan. And I think a "fixed" record layout (only available if explicit declared) is both "structured" and help Executor and Code Generators.

GitMensch commented 7 years ago

The internal implementation would sensibly be done by key-value pairs, i.e. as a map where the component names are the keys. This means you cannot conclude the exact physical order and number of the components.

With the changed rule "records need to be declared beforehand" this shouldn't be an issue any more when using a LinkedHashMap.

codemanyak commented 7 years ago

Once deduced that a declaration is inevitable, the next arising question is how type compatibility is to be defined: by name or by structure? With a named type compatibility approach (like in Pascal) we must introduce type definitions.

Without (named) type definitions i.e. a mere structural approach (like in COBOL), complex types will have to be constructed for every variable again and again. And not even C allows an assignment between varables of unnmamed struct types, even if they are structurally congruent. So this is not actually a viable alternative to type definitions: On export, it's no problem to derive code that doesn't know type definitions (like COBOL) from type definitions in contrast to the other way round. What would be the implications of a type equivalence model by name for Structorizer, now?

We would not allow unnamed record declarations.
Components of a record type may be untyped but only if the respective components don't have record structure.
If procedure parameters may have user-defined types then type definitions must either be locatable above Root level (i.e. outside diagrams) or in includable diagrams (cf #389).
A location above Root level (Structorizer setting?) leaves the problem unsolved where to store this information between sessions.
If the latter approach (in Includables) is preferred therefore, then the import (include) mechanism would have to be changed radically: Instead of being done by modified CALL elements on execution, includes would have to be statically attributed to the ROOT itself such that their impact may involve the routine signature. Instead of placing import CALLs at the beginning of or somewhere else in a diagram, the Root element would have to be equipped with an editable list of diagram names to be included. Possibly this would always have been the better approach for #389, anyway.
Next question is whether type definitions are to be special instruction elements or also attributes of Root elements. Here again the attributation to Root elements seems to be the cleaner approach. But in contrast to includes, the scope of type definitions would not include the routine signature, they are assumed to be local.
This induces the next decision: should type definitions just be placed in the text area of Root elements (just in lines following the signature) or in a new additional text attribute? I propose a new attribute, which of course also requires a new editor design. This also has backward compatibility in mind because until now, line breaks in Root texts (i.e. even within signatures) have been regarded as irrelevant whitespace.
The drawing of Root elements would then have to consider both the include list (to be placed above the signature?) and the type definitions. Maybe a Structorizer setting could be introduced to suppress the presentation of this extra information.

What would be the consequences for Executor?

Qualified variable names are only valid if the path can be verified according to a previous declaration and the associated record type definition.
Assignments among variables are only valid if none of them has record structure or both are declared to the same (named) record type. An assignment of a record variable is by reference if none of the involved variables is a constant.
Initializer expressions do not define record types. They must name the components (as in the examples of the issue description), the order of the component doesn't play a role, it's not necessary that all components be initialized, but no undeclared component name may occur.
Reading access to a declared component is legal; if the component was uninitialized then the result will be undefined (possibly null).
The handling of constant records is as specified in the first discarded draft.

GitMensch commented 7 years ago

Especially because of the type definitions may become a quite long list and someone wants to "outsource" or remove some of them easily (may even have alternatives during design phase and deactivating the alternatives by deactivating their instructions): please leave the type definitions as instructions. The same instructions would be usable in IMPORT diagrams and normal (sub)programs and may be mixed with "normal" declarations (otherwise all type of declarations must be moved, which isn't even possible because of backwards compatibility...). The rest sounds well.

codemanyak commented 7 years ago

For the type definitions, I propose the following syntax variants, which are close enough to Pascal, C, and Basic (with "type" as a keyword similar to "var" and "as") and to the parameter lists of subroutine diagrams (untyped components might be tolerated while they are not meant to be used for records themselves): type MyType = record{ comp1, comp2: int; comp3: double; comp4 } or type MyType = record{ comp1, comp2 as int; comp3 as double; comp4 } or type MyType = struct{ int comp1, comp2; double comp3; comp4[;] }

The above specifications are to be regarded as equivalent. The semicolon after the last component (required in C and Java) should be optional here.

GitMensch commented 7 years ago

I'm a little bit confused. Why adding a type definition which declares a record? I think the following would be usable.

A) plain type definition:

type MyType = { int comp1, char[55] text; double comp3; field[;] }

B) declaring a record using this type definition (not initialized / initialized; maybe removing struct/record completely [implied by using a type] and use "var" there, too):

var somevar as int = 55
struct MyType MyStruct
struct MyType MyStruct2 {somevar, "123", 123456489, "some string"}

C) declaring a record with internal type definition only applied to this record (not initialized / initialized; maybe removing struct/record completely [implied by using a type] and use "var" there, too):

var somevar as int = 55
struct MyStruct { int comp1, char[55] text; double comp3; field[;] }
struct MyStruct2 { int comp1 = somevar, char[55] text = "123"; double comp3 = 123456489; "some string"[;] }

...

Not sure about using type as all, maybe just using record and struct?

codemanyak commented 7 years ago

@GitMensch

A) plain type definition: type MyType = { int comp1, char[55] text; double comp3; field[;] }

This is exactly the same as the third syntax variant of my proposal, just without the keyword record or struct (which I think are useful for readability and instant understandability) and with a wrong separator between the first two components (should be a semicolon, since they are of different types).

B) declaring a record using this type definition (not initialized / initialized; maybe removing struct/record completely [implied by using a type] and use "var" there, too): var somevar as int = 55 struct MyType MyStruct struct MyType MyStruct2 {somevar, "123", 123456489, "some string"}

My proposed and (in the first case already implemented) syntax would be: var somevar as int <- 55 var myStruct: MyType or var MyStruct as MyType as mere declaration. MyType myStruct2 <- {comp1: somevar, text: "123", comp3: 123456789, field: "some string"} or var myStruct2: MyType <- {comp1: somevar, text: "123", comp3: 123456789, field: "some string"} in case of a typed assignment or initialised declaration. If the variable had been declared then a simple assignment would look like this: myStruct2 <- {comp1: somevar, text: "123", comp3: 123456789, field: "some string"}. I don't regard it as helpful if every reference to a record/struct type must be marked with a struct keyword (like in original C). I cling rather to the more general idea of a defined type name that may mean anything (a record, an array, some scalar stuff like in Pascal, C++, Java etc.) as far as it had been defined properly.

C) declaring a record with internal type definition only applied to this record (not initialized / initialized; maybe removing struct/record completely [implied by using a type] and use "var" there, too)

Okay with the replacement of "struct" by "var" in THESE positions (see B). But I thought I had made clear that this (very C-like, beside) implicit declaration style (anonymous types) may work in Executor but would confront the code generators with unsolvable type tracking problems. The problem arises with subsequent assignments among variables, parameter passing etc., which cannot actually be tracked through alternatives, loops and subroutines by the generators or would require structure comparison with weak component types. To allow these syntactic variants would mean to prohibit assignments of the entire variable to another one because this way the type would no longer be confined to the original record. I would not like to impose such a difficult-to-understand "underprivileged" kind of variable. Your last example looks convenient but melts together variable declaration, implicit type definition, and component initialization. It is not even allowed in C, by the way. (Structorizer should not end like COBOL in my eyes.) I would like to adhere to more separate concepts like the ones known from Pascal. I think, Pascal is a good guideline for structograms. I simply wanted to avoid an END keyword and modified the record type definition draft by using braces instead.

GitMensch commented 7 years ago

I see. Did you committed the code already?

codemanyak commented 7 years ago

No, I haven't. I'm still working on the necessary changes of the type map design. And I couldn't spend so much time on it since I had got a pile of other work to do, recently. But I'll continue as soon as possible.

GitMensch commented 7 years ago

@codemanyak Is there an update in sight? Is the target "finish Milestone 3.27 before your vacation" still there (and does it include an roughly working import of COBOL sources)?

codemanyak commented 7 years ago

I still hope so.

codemanyak commented 7 years ago

A first prototype supporting record types involves Executor and Analyser (including Import), possibly with some flaws or bugs. It is just committed und could be tested. I changed the record initializer syntax specification in a way that the type name is to be used as immediate prefix of the opening brace. Generators and Parsers are going to be addressed next. Herer are some example diagrams.

typedeftest423 typedeftest423a

DateImport.zip should be renamed in DateImport.arrz: DateImport.zip

codemanyak commented 7 years ago

After some corrections, a new, record-based version of a binary search tree program is working (rename file "BinSearchTree423.zip" to "BinSearchTree423.arrz" in order to load it):

BinSearchTree423.zip

The figures show two of the four contained diagrams to give an impression:

binsearchtreenode423 showbinsearchtree423-1

GitMensch commented 7 years ago

Looks quite fine. Do you see anything other than the code-generators/parsers to do for being able to close this issue?

codemanyak commented 7 years ago

Well, to be honest: I haven't managed to code this enhancement as clean and modular as I think it ought to be. The need to integrate it in the existing code induced some foul compromises. Once it will be necessary to redesign the many ad-hoc syntax analysis patches fundamentally, though, as a well-structured low-redundance syntax toolbox. That would have been too big an issue for now. So it's likely that different irregularities, limitations, and deficiencies will show in practical use. Apart from that, it's indeed mostly the code generators / parsers that are to be done. I will work on them one by one this month, starting with Pascal and C. COBOL parser / generator are likely to be the last ones I'll pick.

codemanyak commented 7 years ago

Code generator tasks (Pascal accomplished today):

[X] Pascal generator
[x] C generator
[x] C++ generator
[x] C# generator
[x] Java generator
[x] Oberon generator
[ ] PHP generator
[x] Python generator
[ ] Perl generator
[x] Bash generator (?)
[ ] KSH generator (?)
[ ] COBOL generator

Code import tasks:

[x] Pascal parser
[x] C parser
[x] COBOL parser

codemanyak commented 7 years ago

My first Pascal export draft didn't dare to produce structured constant definitions and made some efforts to circumvent them, So this can be simplified a lot.

codemanyak commented 7 years ago

Pascal export revised, now converts structured Structorizer constants to structured Pascal constants.

codemanyak commented 7 years ago

Adaptation of CGenerator done (first approach).

codemanyak commented 7 years ago

Adaptation of C++ generator, CGenerator and Explorer (FOR-IN loops over arrays of records) mended.

codemanyak commented 7 years ago

Java export enhanced for record types, several minor fixes with input and output instruction export, and declaration handling.

codemanyak commented 7 years ago

CParser enabled to cope with most typedefs and struct definitions. CGenerator handling of struct types also revised such that export and import now work in a complementary way.

codemanyak commented 7 years ago

Python generator enabled to export recordtype definitions and record initializers.

codemanyak commented 7 years ago

Type definitions and variable declarations as well as variable access via fully qualified names are now generated based on CobTools. Still not adressed is association of array indices to the correct hierarchy level for the accessor and assignment strings.

codemanyak commented 7 years ago

Oberon generator enabled to export records

GitMensch commented 7 years ago

@codemanyak I'm a little bit puzzled by the generation and the "%num" parts - What do they mean?

Sample:

       01 ZV-REC-IO.
           03 ZV-stuff.
               05 ZV-stuff-a PIC 9(08).
           03 ZV-DATA-IO.
              05 ZV-GEMKZ    PIC X(05).
              05 ZV-ART      PIC 9(01).
              05 ZV-INSTITUT PIC 9(01).
              05 ZV-ZW       PIC 9(02).

           move WS-GEMKZ to ZV-GEMKZ
           move 3 to ZV-INSTITUT
           move 1 to ZV-ART

with the result

grafik

codemanyak commented 7 years ago

@GitMensch The index placeholders "[%1]" shouldn't be there in this example. They are only to be generated if an OCCURS clause occurred or an index variable was associated. They are then intended to be matched against index or subscript expressions. So they should never be seen by a user's eye. Where is your code snippet from?

GitMensch commented 7 years ago

The code snipped is from real code and was generated from bugfix branch. Can you reproduce it with this part only (if not I'll try to get a minimal reproducible version tomorrow)?

codemanyak commented 7 years ago

Please note that the malfunction you reported here is a mere COBOL import issue (so it's misplaced here).

Moreover, I can't reproduce it without additional context (I used the following code and obtained a sensible diagram):

TEST423_badIndices.cbl test423_bad_index

Though until next week I will hardly find any time to work on these issues.

GitMensch commented 7 years ago

I've created an independent issue with sample code. Can you please check if you can reproduce it?

fesch / Structorizer.Desktop

Enhancement draft: record / struct support in Structorizer #423