Add import from COBOL - Githubissues

GitMensch commented 7 years ago

I've found minimal documentation how to do the export - but is there a programmer's documentation how to add a new programming language to the export?

It looks like the import feature is currently only possible from Pascal, therefore I assume the import is even harder to implement, is it?

codemanyak commented 7 years ago

Well, even the export functionality has become pretty complex, meanwhile. I might try to formulate a more or less simplified programmer's guide but that will take a little while. The import feature is indeed still a lot harder, in particular if the source language knows a lot of special syntactic or functional gimmicks, which cannot easily be transformed into the fundamental element types. Pascal is more or less straightforward, but still requires a good parser which works over a given grammar. But imagine only all the things a programmer might do with the C-typical FOR loops or with pointers etc. But well, some aspects like complex data type constructions must even be ignored on import from Pascal. So with other languages, there will just be more losses. Then it might work within certain limits. Do you have specific languages in mind? Possibly we might do it as a "joint venture"?

GitMensch commented 7 years ago

I thought about adding a COBOL im/export function. Obviously an import has always some limits but as I know of two different commercial tools doing "COBOL code to something" (one creating a NSD) I see no reason that it shouldn't be possible. A "joint venture" sounds reasonable, I should be able to do the parsing parts if I have a documented sample for another language and someone helping with the "syntactic sugar".

codemanyak commented 7 years ago

COBOL the old dinosaur? Well, why not... Well, just to start with: The code import is currently based on (an older version of) the GOLDParser, configured with a Pascal grammar. So you may get all you must know to set up a viable COBOL grammar from the documentation there. The parser class deriving the NSD structure from a Pascal parse is lu.fisch.structorizer.parsers.D7Parser (which has been pimped up with some more general Structorizer configuration stuff for historical reasons).

For the generator, I started to write a "howto". You will see the rough outline is rather straigthtforward. The more intriguing details (sections 4 through 7) are still to come, though. The howto file will eventually be placed in the source tree under ...structorizer.generators. howto.txt So you could already start with the fundamental stuff and then pass it over to me for the fine tuning of the export-option-aware details and to find the best way to perform certain tricky syntax conversion. I guess the COBOL expertise is your part, so I might ask you for specific aspects then?

codemanyak commented 7 years ago

Oh, I just saw that the GOLDParser project already offers a COBOL 85 grammar for download... This should make things a lot easier.

GitMensch commented 7 years ago

Specific export aspects: So far I only see the includes, they should be generated with COPY + includename + . and may need a GUI entry. But as those includes either contain variables - which I assume have to be defined in the NSD in any case or program code which needs to be defined in NSD for being able to reference it this may not be that a useful option - if it is added and it is possible to include any copybook it may be useful to say where in the generated source they are placed. Something like "WS: include1, include2; LS: include3; PD: include 4". Thoughts?

I suggest to start with the export as you have the howto already (and yes: placing it in-source-tree is definitely a good idea). Can you do the first import parts in the meantime and tell me where to kick in?

GitMensch commented 7 years ago

BTW: COBOL85 may be a nice start but despite of the multitude of extensions that each compiler has there was a new COBOL standard in 2002 and in 2014 (this dinosaur is actually quite alive ;-)

I have no clue how to get from the grammar file grm to the compiled cgt - do we actually need this for Structorizer?

In general: there are more GOLDParser grammars that may allow an import or won't it be that easy?

codemanyak commented 7 years ago

I have no clue how to get from the grammar file grm to the compiled cgt - do we actually need this for Structorizer?

For the downloadable grammars, the cgt file is alrady contained. For self-written or modified grammars the GOLDbuilder (it's available as command-line executable or as part of an application) would do the conversion. With that compiled grammar tables, another command-line tool, the GOLDprog.exe, creates an engine skeleton in a choosable target programming language using some template. This works quite fine. Then the difficult part begins. One must make sense of the reduction steps of the parser and compose Structorizer elements out of them. I haven't tried this (but I'm going to now), i.e. I don't know for sure whether the program skeleton created by the current conversion tool can work with the GOLDParser version incorporated in Structorizer or if we need to replace the latter by the a newer one. I guess I will know soon. But I'll start with the C language. There I should know what I do...

In general: there are more GOLDParser grammars that may allow an import or won't it be that easy?

There we go.

codemanyak commented 7 years ago

C parsing works. I will just have to generate the NSD from it. When this is done, then the next step will be to adapt the Java engine template to the Structorizer needs such that further import projects may start with an established workflow where we can concentrate on the mere diagram synthesis from the reduction tree. I'm optimistic.

GitMensch commented 7 years ago

Sounds good - I suggest to track "Add import for more languages" in this issue (which will mostly be done by you, I'll assist with COBOL if needed) and create a new one for "export to COBOL" where I'll do the work with your assistance.

Are you OK with this?

codemanyak commented 7 years ago

Okay with me.

codemanyak commented 7 years ago

An early prototype with ANSI-C import could be downloaded from codemanyak/Structorizer.Desktop master...

GitMensch commented 7 years ago

A nice enhancement would be Drag and Drop support for import - Drag and Drop of "nsd" files work already, dropping a C file (I assume the same happens with Pascal) to the GUI just doesn't do anything.

codemanyak commented 7 years ago

The uploaded version was also defective on do-while loops and switch statements. I'm just fixing it. I will then look for the const problem. It would be helpful if you could upload me the two C files. Astonishingly the grammar contains the keywords const as modifier and char as type. This is surprising because otherwise the grammar rather refers to a very early C version (more or less 1973 code). W.r.t. the drag and drop support: In theory it should have worked for Pascal files and I will definitely try to make it work for all file extensions registered with the import plugins.

codemanyak commented 7 years ago

Ah, thank you for the source files. In a way you were right with the const problem. The grammar doesn't accept const modifiers in function result types. You may declare variables with const, but no functions. For the parser logic the parenthesis must have been wrong, therefore. With the first piece of code the problem was that only one modifier is allowed in front of the varable type: either static or const, not both. switch and do while work pretty fine now (I will upload the new version soon), but the next show-stopper is that the grammar doesn't know initialization expressions with braces. I had written a makeshift workaround for one-level single-line initializers but the nested multi-level initializers for structs and arrays kill the parsing. Unfortunately the old parser engine, the source code of which is part of Structorizer, doesn't cope with modified compiled grammars, though I exported them into the old v1.0 cgt format. Maybe we will have to replace the GOLDparser version by he most recent release but this is only available as jar file. I haven't found a source repository so far.

codemanyak commented 7 years ago

Bugs wrt CASE elements (switch) and REPEAT loops (do while) mended. Eventually I found the GOLDParser version by Ralph Iden on GitHub. Will test against this. Would be fine if we could enhance the grammar to achieve a more practical language set.

GitMensch commented 7 years ago

Yes, changing a grammar is obviously a much better option than fixing problems that come from this via workarounds (pre-parse before running the grammar, partly add information back afterwards). What GitHub repository do you refer to?

codemanyak commented 7 years ago

What GitHub repository do you refer to?

Ralph Iden's GOLDEngine

GitMensch commented 7 years ago

Question: Do the current parsers or Structorizer's import in general need a full program? It would be nice to be able to just import a single C function (even if the resulting NSD wouldn't be complete as all includes and static vars would be missing). If needed, I can tweak the COBOL to allow importing of single sections when the general import is ready.

codemanyak commented 7 years ago

The code import should work for single functions, too, and also for files containing several functions. More exactly: Whether the import parsers of Structorizer allow the import of single functions depends on the grammar they rely on. The Pascal grammar only allows to import PROGRAM, UNIT, PACKAGE, or LIBRARY files because the grammar start symbol says so. Unfortunately I can't make effective changes to the D7Grammar because it contains so many ambiguities that the build process (the grammar table compilation) with the GOLDbuild tool doesn't work, not even for the original grammar, neither for GOLDParser version 1.0 nor 5.0. So we will have to adhere to the legacy compiled grammar file as is. Consequently, I had to write a little workaround in the file preprocessing: It simply embeds the bare function in a dummy unit, such that the import works. In C it's no problem at all since functions, procedures, and programs are syntactically all the same. The only postprocessing task is to identify the main routine in order to make aprogram diagram out of it.

GitMensch commented 7 years ago

Yes, C programs/libraries are very similar to a function - they just possible gather many functions. A program/library can start with anything (most times an #include or comment), the program having a function in (normally not the first one) that is named main (or WinMain).

Question: if you have multiple functions in (for example the linked relative simple cobcrun.c) do you build multiple NSDs (one for the program including the static variables and the main, one for each function included), maybe even auto-insert them in the arranger? This would be marvelous!

And yes: Importing a single function makes no big difference. In any case some of the variables are likely hidden within a #include.

But if you want to only convert a single block instead of a function you may start with assigning a value directly, or with a CASE. Would this still work?

For COBOL it likely won't (the language has a quite strict header with separated code DIVISIONS), but I'm sure I can make it work (it actually works with the GnuCOBOL compiler when relaxed mode is on and obviously: no variables are used - but part can be ignored for parsing an incomplete code part into a NSD).

codemanyak commented 7 years ago

Question: if you have multiple functions in (for example the linked relative simple cobcrun.c) do you build multiple NSDs (one for the program including the static variables and the main, one for each function included), maybe even auto-insert them in the arranger? This would be marvelous!

This is exactly the way it works. Hence it IS marvellous. :smile:

But if you want to only convert a single block instead of a function you may start with assigning a value directly, or with a CASE. Would this still work?

No, it wouldn't. To allow this you would either have to enclose it into a dummy function or to provide a separate parser with a reduced grammer. Btw. on exporting you will also automatically have an enclosing program or function in the code. The GOLDparser will not produce a syntax tree (not even a partial one) if the reductions don't lead to the sentence symbol. It will just fail.

GitMensch commented 7 years ago

What do you think about minimal checking before the GOLDparser is started and auto-insert the code into a dummy function when it is necessary? I'm really keen to see the C import working, just to get more realistic: can you please guess long you think you need for this?

codemanyak commented 7 years ago

Was easy for Pascal but is (more) difficult for C. You can have lots of global stuff before you get to a function definition. And one will have to parse or at least check all these lines to be sure there is no function definition. I'd rather postpone this.

GitMensch commented 7 years ago

I'm fine on postponing this as the user workaround of a minimal function is not nice but quite easy to do - just mention it in the doc.

codemanyak commented 7 years ago

Well, I think I can present a very advancd C import now (branch codemanyak/Structorizer.Desktop/master). (Maybe some project configuratin files will have to be adapted to get it running.) The parsing error display is slightly improved and got a button to copy its content to the clipboard as requested. Here is a tested C source derived from your example with only slight modifications such that it passes the syntax check. I enhanced the C grammar a little to allow array/struct initializers (the ones with braces) and a single void in the parameter lists. One strange limitation of the grammar I haven't coped to lift so far: It does not accept user-defined type names!

cobcrun_Issue354.zip

In the code (parsers subfolder), you will also find the generated skeleton for the COBOLParser.

GitMensch commented 7 years ago

Thank you very much! The parsing error display has actually quite improved 👍

[...] (edited out later when moved to #409)

I'll likely start to inspect the COBOL parser more next week, these are the most important questions so far (I hope to see the answes resulting in src\lu\fisch\structorizer\parsers\howto , too : - )

What documentation do you use for the GOLDparser grammar format?
How are tokens in the grammar translated to the Java constants for the parser?
In general: How are we supposed to build the egt files after changing the grm?
What do we (manually?) need to change in the language specific parser after we changed and compiled the grammar? I guess it is only about adjusting the SymbolConstants and RuleConstants, correct (where do we get them from)?
How do we add a completely new grm->egt + parser (I guess adding the grm, then doing the compilation, then copy the templates, then do the adjusting mentioned before [and add the code of course])
Should the preprocessing resolve includes (in the case of COBOL COPY file. statements) from a given list of directories (something a language specific preparser normally does)?
How to set rules for the import from "outside"? The main two parts are: options for the preprocessing (in the COBOL case it would be: options for reference-format [fixed-form or free-form, for the former: code area start/end - everything before/after is kind of a comment], inline comment marker [there are vendor specific extensions...] and in general the directory list for includes mentioned before)

codemanyak commented 7 years ago

As you seem to be implicated in the development of COBOL I assume you being familiar with grammar, parsers and compilers.

What documentation do you use for the GOLDparser grammar format?

All information you need for GOLD Parser can be found on the GOLDParser webpage , though navigation is not quite straightforward there. The GOLDParser Grammar (grm file) is written in EBNF. Some preceding lines defining the lexical rules are (according to the Chomsky type 3 of lexic) some sort of regular expressions. The documentation is around here. Other than yacc, the grammar rules are quite legible. A child could formulate a generating grammar without difficulties. Unfortunately it's not about language generating (where multiple syntax trees for the same sentence are no real problem) but parsing. The rules of the grammar must adhere to certain and not so obvious conventions such that a stack machine can make clear decisions on every read token. This means we must read our rules bottom up and understand why and where the stack machine might get lost on deciding which way to go. One peeked token, ideally a terminal, (look-ahead 1, remember) should be sufficient to disambiguate the input. To achieve this by design of the grammar is anything but trivial, even for very compact languages. A single - apparently obvious - modification may turn out obnoxious and we get drowned in conflict and error messages. Well, back to your questions.

How are tokens in the grammar translated to the Java constants for the parser?

First the grammar rules (the grm file) are inverted by a simple command-line tool (GOLDbuild), which constructs the decision tables for the parser from them. This results in the egt file (if ye are lucky, otherwise we get dozens of conflict reports and no output file). The states and patterns in the tables are still linked to the rules. The terminals and rules are associated with table indices. The second command-line tool, GOLDprog, derives from the tables and the rue structures or token names more or less mnemonic constants in the tagert java file. In this process the template file comes in. It is the java skeleton of the aimed Parser class, were specific markers specify the syntax according to which the refernces to the tables are generated into the code.

What do we (manually?) need to change in the language specific parser after we changed and compiled the grammar? I guess it is only about adjusting the SymbolConstants and RuleConstants, correct (where do we get them from)?

A change in grammar means to repeat the entire process xyz.grm -> xyz.egt -> xyz.java. This costs some seconds. The other good news is that most of the constants will keep their names (not necessarily their values!) if the corresponding rules didn't change, since the constant names follow the rule structure, typically the head and the contained terminals . The indices may change widely, but as good programmers we don't use hard-coded literals, so little grammar changes require only little code changes. Practically it's in most cases indeed sufficient to copy the two blocks of constants from the re-generated GOLDprog output (xyz.java) to the meanwhile grown structogram synthesizer XYZParser.java (no need to overwrite it and start from scratch again!). Did this throw some light into the shades?

The tools GOLDbuild and GOLDprog may be obtained from the Download page, at least as binaries for windows. (This was good enough for me, a package for Linux / Mac etc. seems no so easy to be obtained. On the Builder documentation page you find a short synopsis how to work with them. Its pretty straigtforward, but though told otherwise, the results weren't usable with the old GOLD Parser version formerly used in Structorizer.

How do we add a completely new grm->egt + parser (I guess adding the grm, then doing the compilation, then copy the templates, then do the adjusting mentioned before [and add the code of course])

This is quite straightforword. Place the egt file in the structorizer/parser directory (it is a resource that will be loaded by the parser base class, therefore the reference in the code ("%Name%" in the template) will have to be adapted or maintained. The Parser file fits in the same folder, and for completeness, autodocumentation and optimal maintenance the grammar files should alos reside there. What's left to test is to register the parser class in structorizer/gui/parsers.xml

Should the preprocessing resolve includes (in the case of COBOL COPY file. statements) from a given list of directories (something a language specific preparser normally does)?

This depends. To resolves 'includes in C would mean to copy large chunks of syntactically tricky header files into the code which would rather not improve anything. The Parser doesn't check whether certain identifiers are declared. It's quite sufficient to detect that it's an identifier. So why bother and worsen the situation. With #define directives it's different. Without resolving them, the code may be syntactically compromised. Hence I resolved as a first step simple defines, not considering Macros. This may follow later. I just drop all preprocessor lines in the preparation phase. (So I'm a little puzzled how a preprocessor directive like #ifndef could irritate the lexer - it should never even see it at all! The grammar states explicitly that preprocessor directives are not subject of the grammar (neither for C compilers. The preprocessor shall have resolved the all when it passes the file over.

How to set rules for the import from "outside"? The main two parts are: options for the preprocessing (in the COBOL case it would be: options for reference-format [fixed-form or free-form, for the former: code area start/end - everything before/after is kind of a comment], inline comment marker [there are vendor specific extensions...] and in general the directory list for includes mentioned before)

I'm not familiar with COBOL and so this sound really complcated for me and may actually be. So it will definitely be a good idea to start with a language subset (seems that COBOL 85 is fit for GOLD parser) and then to empower it in small steps.

Last but not least, your error.syntax example - it is quickly explained: If you follow the function declaration rule (<Func Decl>), the you go over <Func ID> to <Type> or void or ID. In contrast to variable declarations there is modifier (<Mod>) involved when you further descend on <Type>. And it doesn' make much sense to declare a function prototype as external. Usually you will obtain it from a header. const doesn't make much sense either, since Kernighan/Richie didn't even know a keyword const. static is also of relative little use for function, they are global anyway. Neither make register, auto, and volatile much sense in front of a function definition. Hence its sacrifice may have seemed good to offer a chance to keep language-inherent ambiguity on a low level, which otherwise might quickly spoil the grammar

Again, the general documentation page helps to get a clue. It's comparably easy with GOLD parser. With lex and yacc it was worse, without any tool it's really hell.

codemanyak commented 7 years ago

Once more to the C problem (52: extern char* __fileselect» (int,char*,char*,char*);). Why was the marker in front of the opening parenthesis instead of before the modifier "static"? Because the parser shifted, i.e. decided that the occurance of a modifier is evidence for a global variable declaration - function declaration was already ruled out according to the simplified grammar. And then the parenthesis wasn't expected. Instead all the symbols listed - to form an array declaration or to end the declaration, or to continue it with listing further ids.

The other proble you reported has nothing to do with the parser. Apparently I used a defective regular expression to resolve the defines. The shards of the regular exprssion are still visible in the error message: (.*?\W)NONOPTION_P (argv[cob_optind][0] != '-' || argv[cob_optind][1] ==(\W.*?)

But hey, its a macro definition! So the parentheses in the substitution pattern meddled with the regular expression syntax. I should have been aware that macros may still occur even if I try my best to ignore them....

I'm so glad we haven't rolled out this version yet... :wink: And I'm really grateful for such an intensive crash test. Thank you very much.

This fixed, the next error shows again an unexpected modifier in an enum declaration, according to the grammar it starts always with "enum":

error.syntax in file "D:\SW-Produkte\Structorizer\tests\Issue354\cobgetopt.c"
Preceding source context:
  79:      expect this.
  80:      RETURN_IN_ORDER is an option available to programs that were written
  81:      to expect options and other ARGV-elements in any order and that care about
  82:      the ordering of the two.  We describe each non-option ARGV-element
  83:      as if it were the argument of an option with character code 1.
  84:      Using '-' as the first character of the list of option characters
  85:      selects this mode of operation.
  86:      The special argument '--' forces an end of option-scanning regardless
  87:      of the value of 'ordering'.  In the case of RETURN_IN_ORDER, only
  88:      '--' can cause 'getopt' to return -1 with 'optind' != ARGC.  */
  89:   static enum » {

Expected: Id

codemanyak commented 7 years ago

Okay, hotfix is online. And with it some little enhancements: The logical inversion of IF elements ( #367 ) and some new file attributes (author, dates, #372 ).

GitMensch commented 7 years ago

Git pull leads to undefined type isParenthesized in Generator.Java and Diagram.Java. Is this a local problem (I'm new to Git...) or is something missing? The only unstaged change is the GUI changelog (which seems to be generated and in this case be removed from vcs and added to gitignore, doesn't it?)

Edit: After restarting Eclipse the build works now ?!? Nonetheless, the question for the changelog persists.

GitMensch commented 7 years ago

The grammar is simply wrong about the modifiers. For static enum {} var see http://stackoverflow.com/questions/4971436/c-what-does-static-enum-mean.

GitMensch commented 7 years ago

Thank you for the howto part, it would be nice to be placed in-source to src\lu\fisch\structorizer\parsers for later reference.

codemanyak commented 7 years ago

Of course I know that the grammar is defective and that you generally can apply storage classes to functions as well. And of course we will hv to enhance the grammar step by step. I had just the faint hope that it might be just good enough for a first version of a big enhancement. But you are right. If the restrictions and limitations are too obvious then it would turn against us and frustration might dominate.

W.r.t. to the temporary compilation problem - it's often Eclipse that needs a directory refresh after some source code changes or even folder structure modifications have taken place without control of Eclipse itself. It caches a lot of information and may get out of synch. W.r.t. to the changelog.txt: Do we talk about the source (src) or the bin directory?

GitMensch commented 7 years ago

Do we talk about the source (src) or the bin directory?

About the bin directory, see #381.

GitMensch commented 7 years ago

Should the preprocessing resolve includes (in the case of COBOL COPY file. statements) from a given list of directories (something a language specific pre-parser normally does)?

This depends. To resolves 'includes in C would mean to copy large chunks of syntactically tricky header files into the code which would rather not improve anything.

I'd say it is even useful for C if not all header files are parsed but the user has the option to set a list of directories on the specific import and only those found are parsed (= not system headers but the ones that belong to the program/library imported).

You actually gave two reasons for this yourself:

The Parser doesn't check whether certain identifiers are declared. It's quite sufficient to detect that it's an identifier. So why bother and worsen the situation.

If I understood this correct we can include a variable definition as instruction. Doing so as post-processing step for all variables we detect (which must be collected in the first run) and can't resolve to a type would be likely helpful (reason 1).

With #define directives it's different. Without resolving them, the code may be syntactically compromised. Hence I resolved as a first step simple defines [...]

#defines actually come very often from an application-specific header file (the most wide spread example is #include "config.h") Not processing this file (at least for #define) would often lead to a nsd that isn't useful as important parts aren't included in the NSD (reason 2).

I suggest to add an import dialogue that opens when importing a file, presenting the user with the file-name, the importer that will be used (maybe as a drop-down field; this removes the need to have multiple menu entries for the import, too) and importer-specific attribute fields (the specific generator has to register them in its constructor) with the likely common "parse headers from these directories" with an attribute type of IMP_ATTR_DIRECTORY_LIST. (COBOL would additional include IMP_ATTR_CHECKBOX with the text "use free-form reference-format" and an additional IMP_ATTR_INPUT with the text "inline comment symbol").

codemanyak commented 7 years ago

It is completely okay to analyse what should be done to come to a flawless and universal product. But just now this is going to be the ultimate overkill. I'm not really inclined to re-invent preprocessors. This tends to a century's task... (or requires at least a full-time job for some months, I'm afraid). An alternative would be to employ the original preprocessors. But this would force us to implement docking points for a wide variety of compiler products with all their specific directory configuration etc., at least gcc, VisualStudio etc. In theory feasible but hardly with sensible efforts, I'd say. At least not as hobby project. It is sort of wishful thinking to expect Structorizer to be capable of transforming e.g. the complete Linux source code without errors in a single pass. That's not realistic. (Alone the used grammar is still so degenerate that it simply doesn't accept type ids in the code. This alone is a real challenge. A compiled grammar cannot simply add some new identifiers as keywords during runtime.)

GitMensch commented 7 years ago

A compiled grammar cannot simply add some new identifiers as keywords during runtime.

Hm, does the C++ grammar has the same issue? The way we do this in GnuCOBOL (lex/yacc based) is to have something along <type> there and define

type:
  char
| int
| word
  {check_if_word_is_type(word)}
;

The error itself (if there is any because "word" is not a type defined before) comes from the function, the parser accepts every word there. And structorizer doesn't need to do the syntax check.

It is completely okay to analyse what should be done to come to a flawless and universal product. But just now this is going to be the ultimate overkill. I'm not really inclined to re-invent preprocessors. This tends to a century's task... (or requires at least a full-time job for some months, I'm afraid).

An alternative would be to employ the original preprocessors. But this would force us to implement docking points for a wide variety of compiler products with all their specific directory configuration etc., at least gcc, VisualStudio etc. In theory feasible but hardly with sensible efforts, I'd say.

I actually thought about using the pre-processor shipped in the compiler, too, but while this may be a good idea for COBOL (where you normally use system libraries only via sub CALL and most programs don't have hundreds of system includes) it maybe isn't for a typical C application.

In any case we can simply document the issue "the current import code only supports a minimal subset of the C pre-processor , you may run the C source through your compiler's pre-processor (for example gcc -E) and import the pre-processed source to compare the results". This is perfectly fine - I wouldn't include the pre-processor compiler run in Structorizer as it would be a big effort to get this right and still would force the user to check for every compiler setting (including runtime variables) to copy the setting.

If I understand you correctly you currently don't want to add pre-processing of #include - can I still ask you to

replace the multiple "Import from xyz" menu entries by a single "Import from source" entry
add an import dialogue which is always opened on import with a drop down control for choosing the language, pre-set depending on the file extension of the chosen file
add an option in the generator to register additional settings for the generator (they don't need to be used and if you don't want to do so for C others may kick in and add the directories for processing #include later - and allow settings like "reference-format" for the COBOL importer)

codemanyak commented 7 years ago

@GitMensch

Hm, does the C++ grammar has the same issue?

I simply haven't checked because I don't dispose of an adapted C++ grammar for GOLD Parser and have obviously not had the time to hack in and adapt the grammar from the C++ standard myself...

The way we do this in GnuCOBOL (lex/yacc based) is to have something along <type> there and define...

I just tried the obvoius thing: To allow an ID among the alternatives for <Type>. But this caused a flood of conflict and error messages immediately. I will have to check meticulously for the exact positions of the conflicts. Maybe the implicit int type declaration granted by the grammar (i.e. if you omit a type specification, the parser assumes an int type) plays a key role here. I consider lifting this "feature" to overcome the trouble. But I'm afraid it might have had some reason the authors of this GOLD Parser "ANSI C" grammar neglected the occurrance of user-defined type names. I don't know, I still hope I can can fix it in the grammar without too many compromises on other aspects.

In any case we can simply document the issue "the current import code only supports a minimal subset of the C pre-processor , you may run the C source through your compiler's pre-processor (for example gcc -E) and import the pre-processed source to compare the results". This is perfectly fine - I wouldn't include the pre-processor compiler run in Structorizer as it would be a big effort to get this right and still would force the user to check for every compiler setting (including runtime variables) to copy the setting.

I totally agree. This limited approach would keep the project at an affordable level.

If I understand you correctly you currently don't want to add pre-processing of #include.

Correct. Not now at least.

replace the multiple "Import from xyz" menu entries by a single "Import from source" entry

Why not. On file dropping and with the command-line mode it has already dawned on me that it is rather superfluous to let the user choose the language while it may be decided from the file name extension. So it would only be consequent to simplify the menu as well. The only sense it had seemed to make - giving an overview which languages are available (still by far not all for which an export is offered) is certainly no justification for an awkward user interface.

add an import dialogue, which is always opened on import with a drop down control for choosing the language, pre-set depending on the file extension of the chosen file

Makes sense lest we should completely deprive the user of the control...

add an option in the generator to register additional settings for the generator (they don't need to be used and if you don't want to do so for C others may kick in and add the directories for processing #include later - and allow settings like "reference-format" for the COBOL importer)

This is the complicated part, because I can only think of an unstructured approach here for all the languages to come :wink:.... Would just a text area (per language) be good enough here (for now) that leaves it to the specific parser to make sense of the respective plain text?

codemanyak commented 7 years ago

After some first considerations, I think I might do the following:

get back to the single hard-coded menu item "Import -> Source Code..."
get all parsers listed in the parsers.xml, add them all as file filters to the FileChooser and leave it to the latter to select whatever.
We can ask the FileChooser for the file filter last used. Then it ought to be clear.
The selection might possibly be heterogenous, however, if the general file filter was in use. So what? At this point it might not be sensible anymore to ask the user which parser(s) (s)he wants to use... It will be an automatic per-file decision, driven by the filename extension. If the outcome isn't satisfying then the files may be selected via a specific filter, so there will be no further doubt about the favourite parser. This would work even in the case that there are extension mapping conflicts among the parsers.

There could be an automatic analysis of file extension conflicts among the parsers, though. Via some button, Structorizer might also show the mapping generated from the listed parser plugins.

codemanyak commented 7 years ago

New menu strategy for code import implemented, ready for download.

This will be my last coding activity for the next two weeks. Will take an off-time.

GitMensch commented 7 years ago

add an option in the generator to register additional settings for the generator (they don't need to be used and if you don't want to do so for C others may kick in and add the directories for processing #include later - and allow settings like "reference-format" for the COBOL importer)

This is the complicated part, because I can only think of an unstructured approach here for all the languages to come 😉....

Given that we have parsers.xml already for telling the GUI which parsers to show I change my original request on letting the parser decide which options to show. The parser is of course free to read or not read a configuration value but the configuration values that exist can be set in the parser. Something along:

<?xml version="1.0" encoding="ISO-8859-1"?>
<plugins>
    <!-- a plugin needs to have a title and a class -->
    <plugin title="Pascal" class="lu.fisch.structorizer.parsers.D7Parser" />
    <!-- a plugin may have optional attributes like an icon -->
    <plugin title="ANSI-C" class="lu.fisch.structorizer.parsers.CParser" icon="c.png" />
    <plugin title="COBOL" class="lu.fisch.structorizer.parsers.COBOLParser" icon="cob.png">
        <!-- a plugin may have parsing options to choose from, needs at least a type attribute, values can be resolved only for options with a name attribute -->
        <option type="text" name="includes" help="List of directories where copybooks are resolved from"/>
        <!-- the type attribute is used for minimal validation and for defining the control to draw -->
        <option type="number" title="tab length" default="8"/>
        <option type="checkbox" name="free-form" help="use free-form reference-format, otherwise fixed-form reference-format is used"/>
        <!-- depending on the type an option may have sub-items -->
        <option type="radiogroup" name="sub-diagram" title="generation of sub diagrams" default="1">
            <option type="item" value="0" title="no sub diagrams" />
            <option type="item" value="1" title="sub diagram for sections" />
            <option type="item" value="2" title="sub diagram for paragraphs" />
        <option>
    </plugin>
</plugins>

with something like

   String pSIncludes;
   int pSTabLength;
   bool pSFreeForm;
   int pSSubDiagram;
   setValueFromParserSetting("includes", pSIncludes);
   setValueFromParserSetting("tab-length", pSTabLength);
   setValueFromParserSetting("free-form", pSFreeForm);
   setValueFromParserSetting("sub-diagram", pSSubDiagram);

Maybe "hide" these settings in the Import Dialog by putting them in a "Parser Options" button (de-activated if the parser has no option defined in parser.xml) and showing a dialogue window if it is pressed.

Would just a text area (per language) be good enough here (for now) that leaves it to the specific parser to make sense of the respective plain text?

If the approach above is too much for now an unstructured text that needs to be split by the parser (as it is done with the generator for the standard-includes to generate) is not as nice as the one above but would work, too.

GitMensch commented 7 years ago

New menu strategy for code import implemented

I see - it is good, would even be better if the Import would go one level up (it is currently Import->Import from Code..., but as we don't have anything else it makes sense to place this one level higher)

This will be my last coding activity for the next two weeks. Will take an off-time.

Thank you very much for the things you've done so far. I think I can work on the COBOL importer soon (and define the settings already but hardcode their value for now).

codemanyak commented 7 years ago

I see - it is good, would even be better if the Import would go one level up (it is currently Import->Import from Code..., but as we don't have anything else it makes sense to place this one level higher)

I actually thought about it but then decided against due to the following considerations:

Symmetry reason (with export),
I'm thinking about other import sources, in particular from other structrogram editor file formats (some time later perhaps)
I added a new key binding Ctrl-Shift-I (in analogy to Ctrl-Shift-X for the favourite code export), so there will be little need to use the menu at all.
Preserves the status quo (well, that's a joke).

codemanyak commented 7 years ago

Given that we have parsers.xml already for telling the GUI which parsers to show I change my original request on letting the parser decide which options to show. The parser is of course free to read or not read a configuration value but the configuration values that exist can be set in the parser.

Well, the parsers.xml file isn't visible to the user. It's just an internal resource file. That means it will hardly be suited to do some user customisation there. For the (structorizer-internal) configuration of the appearance and customisable properties of the parsers, however, it might be perfect. Possibly I misunderstood your original suggestions. My answer had only in mind, how the user might specify some language-specific peculiarities where Structorizer would be too neutral to understand that. I think the parser configuration in parsers.xl to offer meaningful structured configuration is a really good idea and definitely worth a try. (But not today.) Btw, I just initiated a bugfix version 3.26-05. The "big bang" with the import enhancements won't be delivered before we will have solved this and some already commented problems (type ids in C grammar etc.).

GitMensch commented 7 years ago

It is good that we found a common ground for the options the parsers (may) use with parser.xml. Have a good time-out.

GitMensch commented 7 years ago

Just a note when you're back: I like the new dialogue

The source file of "x" is ambiguous. Please select an import language: [...]

But before the dialogue is called it would be good to check if the file actually exists (just enter "X" and press OK). If it doesn't exist you get to the null pointer or message noted in #384.

If you abort the dialogue a message is shown that shows the path name without backslashes:

Code import for file "C:usersmeDocumentsX" cancelled.

I've did some fixes to the COBOL-85 grammar + started a heavy extensions COBOL 2002 + extensions, wrote a little COBOL pre-parser and can now read in some COBOL code - but to be usable without the need for a program to be changed in multiple places the grammar needs a complete rewrite. After putting nearly 6 h of time in the current grammar I'd say this would be at least two weeks full time work - waaaay to much. GOLDParser ships a bison/yacc import, if we could import GnuCOBOL's parser.y (the importer currently can't do this) I'd guess the time for necessary parser changes would be something around 2 days. I've asked the author about the import problems and will report if/when I get an answer. Nonetheless the extended COBOL-85 grammar and minimal preparser may something we could place in the source tree.

GitMensch commented 7 years ago

BTW: Some minor things to the CParser:

-               else if (!strLine.startsWith("#")) {
-               else if (!strLine.trim().startsWith("#")) {   //preprocessort directives start at first non-space [trim() cuts tabs, too, this is fine]

Some things to the C grammar

attributes make the parser fall, for example __declspec(dllexport) int myfunc (void *, void *);
long long literals like 1LL aren't accepted, it is likely the easiest solution to change the decimal to DecLiteral = [123456789]{digit}*('LL')? or define a new kind of literal and add it where necessary

codemanyak commented 7 years ago

long long literals like 1LL aren't accepted, it is likely the easiest solution to change the decimal to DecLiteral = [123456789]{digit}*('LL')? or define a new kind of literal and add it where necessary

Well, there's a lot more to it: long literals (with a single 'L' suffix) - not only with decimal constants -, unsigned literals (with 'u' suffix), also in combination, float literals with 'f' or 'F' suffix etc. The type names "[unsigned] long long" hadn't been supported, either, (without which it doesn't make much sense to enable long long literals) and so on. Anyway we should define which ANSI-C version we want to refer to (C73, C99, ...?). The grammar found on the GOLD Parser homepage suggested it would reflect the original Kernighan-Ritchie C but that wasn't true.

GitMensch commented 7 years ago

Anyway we should define which ANSI-C version we want to refer to (C73, C99, ...?). The grammar found on the GOLD Parser homepage suggested it would reflect the original Kernighan-Ritchie C but that wasn't true.

C99 would be a good start. I'm going to update the COBOL ANSI85 version (now passes at least two test programs from the National Institute of Standards [the official COBOL 85 test suite]) but I think we shouldn't use it. Instead I suggest we add my newer one, a start implementation of COBOL 2014. Note: I won't finish the COBOL 2014 grammar because it needs a complete rewrite - an import from the GnuCOBOL parser would be much better and if we get the import working I'll come back and put some efforts in it. The minimal preparser needed is finished, too, should I create a fork of your branch to add the ANSI 85/2014 grammar and the preparser + the merge request?

codemanyak commented 7 years ago

Okay, long long type and literals (as well as unsigned literals and float literals) added to C grammar.

fesch / Structorizer.Desktop

Add import from COBOL #354