avishek-sen-gupta / cobol-rekt

An evolving toolkit of capabilities helpful for reverse engineering legacy Cobol code
MIT License
89 stars 9 forks source link

Cobol-REKT Banner

Cobol-REKT (Cobol Reverse Engineering KiT)

Maven Package

Elevator Pitch

Cobol-REKT is an evolving toolkit of capabilities helpful for reverse engineering legacy Cobol code. These capabilities range from building flowcharts of the underlying code to translating COBOL into a hybrid Intermediate Representation (graph- and instruction- based) potentially suitable for transpilation to other languages.

Engineers can pick components from this library to incorporate into their reverse engineering analysis workflows, or more cohesive products suitable for wider audiences like analysts.

This is also why Cobol-REKT does not have a UI (though there may be one in the future for showcase purposes).

An important aspirational aim of this library is to serve as a testbed for some of the following:

In addition, the library also includes several reusable algorithms which are commonly used in analysis in compiler middleware toolchains.

Backlog

You can see the current backlog here.

Contents

Introduction

Cobol-REKT is a continually evolving, actively maintained collection of capabilities helpful for reverse engineering legacy Cobol code. The following is a representative list of the capabilities currently available:

Demo App (Very Early - WIP)

This app is ultimately intended to only give a showcase of the capabilities of the library, and is currently in a nascent state and not ready for use. Nevertheless, the screenshot below shows the user navigating the syntax highlighted Intermediate Representation source, with the IR Syntax Tree is shown in the graph. The top right pane shows the currently selected node information, and the bottom right pane shows the available projects.

Demo App Early

Philosophy

Cobol-REKT is more of a library of useful things intended to be embedded in more formal reverse engineering workflows/pipelines, rather than being a standalone tool (though you can certainly use it as such). Many of the higher-level wrappers are merely sensible defaults; you are encouraged to modify them to suit your needs.

It is also to be noted that OpenAI integration is only attached to specific features. Much of the functionality of this toolkit is based on deterministic analysis.

The toolkit consists of Java components, most of which are around parsing, ingestion of, export from Cobol sources, and control flow analyses, and Python components, which carry out some of the other analyses (including LLM-assisted summaries).

In addition, the support for Neo4J allows you to build custom graph analyses as per your needs. I expect most of the dynamic analysis to happen through Python (using Neo4J or NetworkX), hence the Java component tries to unlock as much of the ingested data as possible in different formats.

Major Dependencies

Reverse Engineering Use Cases

Some reverse engineering use cases are listed below. Descriptions of the capabilities which support these use cases are provided in later sections.

Roadmap

Flowchart Generation

This capability allows the engineer to transform Cobol source (or part of it) into a flowchart. The flowchart stays true to the source but omits syntactic noise to produce a detailed flow of logic through the source. The capability uses Graphviz to generate the flowchart images. The flowchart can be generated by specifying the DRAW_FLOWCHART task.

NOTE: You need to have installed Graphviz to generate flowcharts. Specifically, the dot command should be available on your path.

Example flowchart of the program test-exp.cbl

Flowchart

Interpreting the Flowchart

The dotted lines indicate things that are inside a node. So, in the above example, after the first beige "Processing" block, there is a node which leads to an IF statement through the dotted line. What happens inside the IF statement can be understood by "stepping into" this dotted line. The normal flow after the IF statement finishes can be continued by returning to the node where the dotted line originates.

You can also generate Mermaid flowcharts for embedding in other Markdown files. You can use the EXPORT_MERMAID task for this. Currently, this supports only section-wise generation of Mermaid charts.

Mermaid Flowcharts

Parse Tree Generation

This allows the engineer to produce the parse tree of Cobol source. This is suitable for use in further static analysis, transformation (into a control flow tree, for example), and inputs to other systems (informed chunking to an LLM, into a graph database for further exploration, etc.). See Reverse Engineering Use Cases for more examples.

Most of the capabilities are already present in the Che4z library. Some new grammars have been added. They are:

This capability can be used by specifiying the WRITE_RAW_AST task.

Control Flow Generation

This capability allows the engineer to produce a control flow tree for the Cobol source. This can be used for straight-up visualisation (the flowchart capability actually uses the control flow tree behind the scenes), or more dynamic analysis through an interpreter. See SMOJOL (SMol Java-powered CobOL Interpreter) for a description of how this can help.

The CFG generation is part of the FLOW_TO_NEO4J task.

Note that this is not the same control flow model which is used in the transpiler tasks. For that, see Experiments in Transpilation.

Neo4J Integration

Both the AST and the Control Flow Graph can be injected directly into Neo4J. The AST injected can be in the same format as the SMOJOL interpreter, or the bare parse tree generated by ANTLR. The Control Flow Graph is in the SMOJOL AST format.

When generating the AST and CFG, the library allows configuring them to be the same, i.e., the same nodes are reused for creating both AST and CFG connections. For example, in the screenshot below: the same CFG has CONTAINS (AST relationship), FOLLOWED_BY (CFG relationship), and the MODIFIES/ACCESSES relationships (data structure relationship).

This provides a rich unified view of the entire program, without having to jump between multiple disconnected views of the source code, for analysis.

This can be done by specifiying the INJECT_INTO)NEO4J task.

Unified AST-CFG-Data Graph

LLM-augmented Analysis

Depth-First Summarisation

The OpenAI integration can be leveraged to summarise nodes in a bottom-up fashion (i.e., summarise leaf nodes first, then use those summaries to generate summarise the parent nodes, and so on).

The following diagram shows the AST, the Control Flow Graph, and the data structures graph. The yellow nodes are the summary nodes (generated through an LLM) attached to the AST (to provide explanations) and the data structures (to infer domains).

ast-cfg-structs-graph

Building Glossaries (ALPHA)

The toolkit supports building glossaries of variables given the data structures in a program. This capability is provided through Python in the smojol_python component. To use this facility, start by exporting the data structures to JSON, through the JAR, like so:

java -jar smojol-cli/target/smojol-cli.jar run YOUR_PROGRAM --commands="BUILD_BASE_ANALYSIS WRITE_DATA_STRUCTURES" --srcDir /path/to/sources --copyBooksDir /path/to/copybooks --dialectJarPath che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --dialect IDMS --reportDir /path/to/report/dir

This will generate a JSON file in /path/to/report/dir. After this, you can run:

cd smojol_python
python -m src.llm.glossary_builder.main /path/to/report/dir/program-data.json out/glossary.md

This will generate the glossary in out/glossary.md. Integrating other out-of-band data sources is a work in progress.

Building Capability Maps (ALPHA)

The toolkit supports extracting a capability map from the paragraphs of a source. For this, you need to generate the AST in Neo4J, as well as the data structures JSON. You can do this via:

java -jar smojol-cli/target/smojol-cli.jar run YOUR_PROGRAM --commands="BUILD_BASE_ANALYSIS FLOW_TO_NEO4J WRITE_DATA_STRUCTURES" --srcDir /path/to/sources --copyBooksDir /path/to/copybooks --dialectJarPath che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --dialect IDMS --reportDir /path/to/report/dir

After this, you will want to extract the paragraph capabilities, like so:

python -m src.llm.capability_extractor.paragraph_capabilities /path/to/data/structures/json /paragraph/capabilities/output/path ../paragraph/variables/explanation/output/path

This will generate the capabilities in /paragraph/capabilities/output/path. At this point, you may need to clean parts of the output manually, if some entries do not correpond to a comma-separated list of domain terms (efforts to eliminate this manual process are in progress).

The final step is to actually generate the capability map:

python -m src.llm.capability_extractor.capabilities_graph /paragraph/capabilities/output/path

Capability Map Extraction Screenshot

This will take a little time, depending upon the number of paragraphs and their sizes. At the end, it will generate a dendrogram visualisation, as well as the capability map in Neo4J, as illustrated below (for a 10000+ line COBOL program).

Capability Map Dendrogram

Capability Map Neo4J

Data Dependency Graph

This capability connects records which modify other records, with a FLOWS_INTO relation. The dependencies traced include variables which are used in expressions, as well as freestanding literals. Below is an example of a set of record dependencies from a program. It also generates connections based on REDEFINES clauses.

record-dependencies-graph

Comments Integration

This capability connects comments to the nearest subsequent node, with a HAS_COMMENT connection. This works for comments in the PROCEDURE division and all data structures. Comments before copybooks are connected to the main program node. Any comments which cannot be attached to found nodes, end up being connected to the main program node.

This can be done by specifying the ATTACH_COMMENTS task. Note that for the comment nodes to appear in the graph, the FLOW_TO_NEO4J task must appear after the ATTACH_COMMENTS task.

The example below shows all node-comment sets for a reasonably large program.

comment-nodes

SMOJOL (SMol Java-powered CobOL) Interpreter

The interpreter is a natural extension to building the parse tree for a Cobol source. Since syntax nodes are temporally ordered, it is possible to build an execution tree which covers all possible flows. This is the basis for flowchart generation, and also the basis for a tree-walk interpreter. The interpreter sets up the AST which is a projection of the parse tree more suitable for execution. Parts of the interpreter convert specific nodes in the AST into more suitable forms at runtime (for example, expression evaluation).

The primary motivation for the interpreter is to be able to simulate the execution of programs (or fragments of programs) in a sandboxed environment where the engineer needn't worry about fulfilling dependencies required to run the code in a true mainframe environment. Rather, they can inject these dependencies (values of variables, for example) as they see fit, to perform their true task: namely, performing control flow analysis.

The interpreter can run in two modes:

Current Capabilities of the Interpreter

Planned Capabilities for the interpreter

Example interpreter session demonstrating breakpoints, stack traces, and record inspection

Interpreter Session

Integration with Neo4J

The interpreter also supports injecting a complete execution path through the program into Neo4J. The screenshot below shows the execution trace of a reasonably complex program.

Execution Trace Graph

GraphML Export

You can export to GraphML from Neo4J pretty easily. If you don't have Neo4J set up though, the toolkit allows exporting the following entities to the GraphML format (with all interconnections) as a single supergraph, so that they can be consumed further downstream for analysis:

The screenshot below shows a sample program's unified model exported to GraphML and loaded through the yEd Graph Editor.

yEd Screenshot of GraphML Exported Data Structures

To export the unified graph to GraphML from Neo4J, you can run the following command:

CALL apoc.export.graphml.all("<export.graphml>", {})

The file will be in the import directory inside the directory where the current database files are stored (in Neo4J Desktop).

Analysis through JGraphT

In addition to writing to Neo4J and leveraging its data science capabilities to analyse the graph(s), the library also embeds JGraphT, a powerful library of graph algorithms. The JGraphTBuilder class converts the unified model (AST, CFG, Data Structures) into a DirectedPseudograph (because there can be both loops and parallel edges between two nodes), for consequent analysis through the JGraphT API.

Custom analyses are a work in progress. The COMPARE_CODE task, for example, uses the JGraphT library.

Useful Analyses through plain Python

1) Analyse static value assignments to variables

This is useful for when you are looking for the range of values which are assigned to a record in a program. You will need to execute the WRITE_RAW_AST task first, like so:

java -jar smojol-cli/target/smojol-cli.jar run test-exp.cbl hello.cbl --commands="BUILD_BASE_ANALYSIS WRITE_RAW_AST" --srcDir /Users/asgupta/code/smojol/smojol-test-code --copyBooksDir /Users/asgupta/code/smojol/smojol-test-code --dialectJarPath ./che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --reportDir out/report --generation=PROGRAM

Once you have the AST file, you can run the analysis like so (making sure first that you are in the smojol_python directory):

python -m src.analysis.variable_static_values /path/to/ast/json --output=/path/to/output

If you omit the --output flag, it will simply print out the results.

Analysis through NetworkX

If you export the Unified Model to JSON, you can import it into NetworkX quite easily. The unified_model_to_networkx script lets you ingest the JSON and create both the fully-connected NetworkX graph, and the in-memory Python equivalent of the Unified Model. You're free to extract out specific parts of the model through convenience functions. The following code extracts out the AST, CFG, and the data structures separately. You can customise extraction for your use case; take a look at any one of those convenience methods for guidance.

with open(input_path, 'r') as file:
    unified = json.load(file)
    ast_x, _, _ = extract_ast(unified)
    cfg_x, _, _ = extract_cfg(unified)
    ds_x, _, _ = extract_ds(unified)

There are currently two tasks implemented using NetworkX.

Code Similarity

TODO... See similarity.py.

Code Pattern Recognition

It is very easy to do pattern matching in Neo4J, so you might choose to use those facilities instead (see Analysis through Neo4J). However, pattern matching is also possible using NetworkX directly using subgraph isomorphisms.

You will need to construct the subgraph pattern through code, which is not as elegant as writing Cypher queries, but it does work. See pattern_matcher.py for an example.

Analysis through Neo4J

Code Pattern Recognition

You can match patterns pretty easily through Cypher. See neo4j_pattern_matcher.py for an example of how to match a call pattern, i.e., a bunch of sequential MOVE statements, followed by a CALL statement. The patterns being developed are documented in neo4j_pattern_queries.py.

You can find some useful Neo4J-based analysis queries in Analysis

Control Flow Analysis and Transpilation Experiments

Most of the tasks in this category are meant to be used as part of a larger analysis workflow, and thus do not have any filesystem outputs. All the analyses use JGraphT's DefaultDirectedGraph for representing and manipulating graph structures.

Also note that most of the tasks under this category are not specific to COBOL, and can be used for analysing control flowgraphs derived from any language.

Exposing a basic transpilation Model (Instructions and Basic Blocks)

This target exposes a basic transpilation model which is not tied to the COBOL syntax. It uses only assignments, loops, conditions, and jumps to represent most of COBOL syntax. The result may not still be well-structured because of arbitrary GOTOs. This will be the input for further control flow analysis tasks.

The model currently consists of the following:

For example, if we have a EVALUATE statment like the following:

        EVALUATE TRUE ALSO TRUE
              WHEN SCALED + RESULT < 10 ALSO INVOICE-AMOUNT = 10
                MOVE "CASE 1" TO SOMETHING
              WHEN SCALED + RESULT > 50 ALSO
                INVOICE-AMOUNT = ( SOMETEXT + RESULT ) / SCALED
                MOVE "CASE 2" TO SOMETHING
              WHEN OTHER
                MOVE "CASE OTHER" TO SOMETHING
            END-EVALUATE

Then, the following is an example of the text representation of the transpiler tree of the above statement (formatted for clarity):

if(and(eq(primitive(true), lt(add(ref('SCALED'), ref('RESULT')), primitive(10.0))), eq(primitive(true), eq(ref('INVOICE-AMOUNT'), primitive(10.0))))) 
 then 
{
    CODE_BLOCK: CODE_BLOCK: set(ref('SOMETHING'), value(primitive("CASE 1"))) 
}

else 
{
    if(and(eq(primitive(true), gt(add(ref('SCALED'), ref('RESULT')), primitive(50.0))), eq(primitive(true), eq(ref('INVOICE-AMOUNT'), divide(add(ref('SOMETEXT'), ref('RESULT')), ref('SCALED')))))) 
     then 
    {
         CODE_BLOCK: CODE_BLOCK: set(ref('SOMETHING'), value(primitive("CASE 2"))) 
    }

    else 
    {
         CODE_BLOCK: CODE_BLOCK: set(ref('SOMETHING'), value(primitive("CASE OTHER"))) 
    }
}

As another example, let's take a PERFORM INLINE statement which looks like so:

PERFORM TEST BEFORE VARYING SOME-PART-1 FROM 1 BY 1
UNTIL SOME-PART-1 > 10
AFTER SOME-PART-2 FROM 1 BY 1 UNTIL SOME-PART-2 > 10
    DISPLAY "GOING " SOME-PART-1 " AND " SOME-PART-2
END-PERFORM.

The intermediate representation looks like the following:

loop[loopVariable=ref('SOME-PART-1'), initialValue=primitive(1.0), maxValue=NULL, terminateCondition=gt(ref('SOME-PART-1'), primitive(10.0)), loopUpdate=primitive(1.0), conditionTestTime=BEFORE] 
{
    loop[loopVariable=ref('SOME-PART-2'), initialValue=primitive(1.0), maxValue=NULL, terminateCondition=gt(ref('SOME-PART-2'), primitive(10.0)), loopUpdate=primitive(1.0), conditionTestTime=BEFORE] 
    {
        CODE_BLOCK: print(value(primitive("GOING ")), value(ref('SOME-PART-1')), value(primitive(" AND ")), value(ref('SOME-PART-2')))
    }
}

You can view the formatted outputted of any node (program root or otherwise), by using the TranspilerTreeFormatter's format() method. The following is the formatted IR representation of simple-goto.cbl.

BLOCK [ProcedureDivisionBodyContext/T1] {
  {
  }
  BLOCK [S] {
    placeholder: S SECTION
    {
      BLOCK [SA1] {
        placeholder: SA1
        {
          print(value(primitive("ABCD")))
        }
        {
          if (gt(ref('WS-NUM1'), primitive(10.0)))
          {
            jump(loc(SZ1), [NULL])
          }
          else
          {
            print(value(primitive("<= 10")))
          }
        }
        {
          print(value(primitive("SA1-1")))
          print(value(primitive("SA1-2")))
        }
      }
      BLOCK [SZ1] {
        placeholder: SZ1
        {
          print(value(primitive("ENDING...")))
        }
      }
    }
  }
}

The screenshot below shows a part of an example transpiler model flowgraph.

Part of an Example Transpiler Model CFG

See TranspilerBuildMain.java for an example.

The BuildTranspilerFlowgraphTask creates the intermediate AST, instructions, and the Basic Block tree.

In addition, this task can accept a set of flow hints. These flow hints allow the engineer to explicitly specify section/paragraph names which are not targets of fallthrough flows. Without these hints, some programs can yield irreducible control flowgraphs, which require more complicated resolutions, come transpilation time. See Reducibility Test for more details.

Details of the Intermediate Transpiler Tree

Basic Blocks (Experimental Feature)

Basic Blocks are useful for analysing flow of the code without worrying about the specific computational details of the code. They are also useful (and the more pertinent use-case in our case) for rewriting / transpiling potential unstructured COBOL code (code with possibly arbitrary GOTOs) into a structured form / language (i.e., without GOTOs).

Exposing basic blocks is done through the BuildBasicBlocksTask task. Note that this task does not actually output any artifacts, because it is intended for more internal analysis and transpilation (if I get to it at some point). Each BasicBlock contains a list of straight-line TranspilerInstructions.

Note that if you use the BuildTranspilerFlowgraphTask task, Basic Blocks are automatically generated for you.

Reducibility Testing

[TODO: Write about reducibility]

1. Reducibility Testing using T1-T2 Transforms

Reducibility is tested using interval analysis, specifically using the repeated T1-T2 transform method.

See TranspilerInstructionIntervalAnalysisMain.java for an example.

2. Reducibility Testing using DJ Graphs

A second technique for testing reducibility follows the method outlined in [Sreedhar-Gao-Lee, 1996].

Improper Loop Detection

1. Improper Loop Heuristic using Strongly Connected Components

Strongly Connected Components in a flowgraph represent the most general representation of looping constructs. Proper SCC's have only one node in them that can be the entry point for any incoming edge from outside the SCC. These are natural loops. Having multiple entry points implies that there are arbitrary jumps into the body of the loop from outside the loop, which makes the loop improper, and consequently the graph, irreducible.

It is important to note that even if no improper SCC's are detected, it does not imply that the flowgraph is reducible. See the flowgraph built in counterExample() in ReducibleFlowgraphTest for an example of such pathological graphs.

Proper SCC's are a necessary condition for a reducible flowgraph, but not a sufficient condition. The sufficient condition is that no strongly connected subgraph be improper. However, SCC's are maximal strongly connected subgraphs, which means they can contain improper strongly connected subgraphs inside them, which is why the distinction is important.

This is, however, a good test which can surface loop-specific reducibility problems. The test is done using the IrreducibleStronglyConnectedComponentsTask task.

Strongly Connected Components are detected using JGraphT's built-in Kosarajau's algorithm for finding SCC's.

2. Reducible and Irreducible Loop Body Detection

The technique of testing flowgraph reducibility in the previous section also extends to finding the all loop bodies, both reducible and irreducible. See LoopBodyDetectionTask.

This technique works by first building the Dominator-Join graph (DJ graph), then identifying reducible and irreducible loop bodies in reverse tree depth order, so that inner loops are detected first and folded into single abstract nodes.

Irreducible loop bodies are found by identifying back edges which are also Cross-Join edges, and then using an SCC algorithm to find SCCs in the subgraph induced by all nodes at that level or deeper.

Dominator Analysis

Several tasks are required to be run to do dominator analysis.

See DominatorAnalysisMain.java for an example.

Reaching Conditions (aka, "How did I get here?")

Given a graph slice, a start node, and a sink node, this calculates the actual conditions that need to be satisfied to reach the sink node. For example, assume we have the program like the following;

       IDENTIFICATION DIVISION.
       PROGRAM-ID.    STOPRUN.
       AUTHOR.        MOJO
       DATE-WRITTEN.  SEP 2024.
       ENVIRONMENT DIVISION.
       DATA DIVISION.
       WORKING-STORAGE SECTION.
            01 WS-NUM1 PIC 9(9) VALUE 5.
       PROCEDURE DIVISION.
       S SECTION.
       SA1.
           IF WS-NUM1 > 10
            THEN
                GO TO SA2.
           STOP RUN.
       SA2.
           IF WS-NUM1 = 201
            THEN
                DISPLAY "IT IS DONE".
        STOP RUN.

If we run the task on this flowgraph, with the first instruction as the source, and the DISPLAY statement as the sink node, the reaching condition for the statement will be given as:

...
ENTER: CODE_BLOCK: print(value(primitive("IT IS: and(eq(ref('WS-NUM1'), primitive(201.0)), gt(ref('WS-NUM1'), primitive(10.0)))
...

To be more precise, the reaching conditions of all the nodes in the Depth-First Search tree with the start node as the root, are calculated.

A few important notes about this:

AST Refactoring to eliminate GO TOs

This is an experimental technique based on Taming Control Flow: A Structured Approach to Eliminating Goto Statements to eliminate GO TOs by restructuring the AST.

The TreeSmith class provides capabilities for the following:

Let's consider the simple-goto.cbl program. Eliminating the GO TO in this code, gives us the following code:

BLOCK [ProcedureDivisionBodyContext/T1] {
  {
  }
  BLOCK [S] {
    placeholder: S SECTION
    {
      BLOCK [SA1] {
        placeholder: SA1
        {
          print(value(primitive("ABCD")))
        }
        {
          if (gt(ref('WS-NUM1'), primitive(10.0)))
          {
            set(ref('SOME'), primitive(true))
            if (not(primitive(true)))
            {
            }
            else
            {
            }
          }
          else
          {
            print(value(primitive("<= 10")))
          }
          set(ref('SOME'), primitive(true))
          if (not(primitive(true)))
          {
          }
          else
          {
          }
        }
        set(ref('SOME'), primitive(true))
        if (not(primitive(true)))
        {
          {
            print(value(primitive("SA1-1")))
            print(value(primitive("SA1-2")))
          }
        }
        else
        {
        }
      }
      if (not(primitive(true)))
      {
      }
      else
      {
      }
      BLOCK [SZ1] {
        placeholder: SZ1
        {
          print(value(primitive("ENDING...")))
        }
      }
    }
  }
}

NOTE: This is a work in progress. Variables which are used to descope jumps, are currently not named uniquely.

Here are some points to note regarding this technique:

This is probably best used for smaller scale refactorings, like eliminating jumps inside a section. For larger scale refactorings, a different approach (potentially based on Automatic compiler restructuring of COBOL programs into a proc per paragraph model - Patent US5778232A (Expired 2016)) might give better results.

Running against AWS Card Demo

The library has been tested (validation, flowchart generation, AST generation, Unified Model generation) against the AWS Card Demo codebase. To run it against that codebase, do the following:

Now run your commands as usual.

Developer Guide

How to Build

The toolkit uses JDK 21 features; so you'll need the appropriate JDK set up.

Run: mvn clean install.

The Checkstyle step is only applicable for the Eclipse Cobol parser project. You can skip the Checkstyle targets with:

mvn clean verify -Dcheckstyle.skip=true

You can skip the tests as well, using:

mvn clean verify -Dmaven.test.skip=true

For generating flowcharts, you will need to set up Graphviz on your box; see the site for OS-specific installation instructions.

Before using any of the Python components, you will need to install dependencies, like so:

cd smojol_python
pip install -r requirements.txt

CLI Usage

The individual functionalities in the Java component can be invoked using different commands. Further tasks / commands will be added.

Command: run

This command encapsulates almost all the tasks that you are likely to run. The descriptions of the various commands are listed below.

For example, if you wanted to run all of the above, you could run the following command:

java -jar smojol-cli/target/smojol-cli.jar run test-exp.cbl hello.cbl --commands="BUILD_BASE_ANALYSIS WRITE_FLOW_AST FLOW_TO_NEO4J FLOW_TO_GRAPHML WRITE_RAW_AST DRAW_FLOWCHART WRITE_CFG" --srcDir /Users/asgupta/code/smojol/smojol-test-code --copyBooksDir /Users/asgupta/code/smojol/smojol-test-code --dialectJarPath ./che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --reportDir out/report --generation=PROGRAM

Passing the validation flag (--validate or -v) skips running all tasks, and simply validates whether the source is syntactically correct. This is non-strict validation, i.e., invalid variable references are reported, but do not cause failure.

Specifying the --permissiveSearch flag matches file names more permissively, i.e., case-insensitive matching, and optional matching of files with .cbl extensions.

The help text is reproduced below (obtained by adding --help):

Usage: app run [-hpvV] [-d=<dialect>] [-dp=<dialectJarPath>]
               [-f=<flowchartOutputFormat>] [-g=<flowchartGenerationStrategy>]
               -r=<reportRootDir> -s=<sourceDir> -c=<commands>
               [-c=<commands>]... -cp=<copyBookDirs>[,<copyBookDirs>...]
               [-cp=<copyBookDirs>[,<copyBookDirs>...]]... [<programNames>...]
Implements various operations useful for reverse engineering Cobol code
      [<programNames>...]    The programs to analyse
  -c, --commands=<commands>  The commands to run (BUILD_BASE_ANALYSIS,
                               FLOW_TO_NEO4J, FLOW_TO_GRAPHML, WRITE_RAW_AST,
                               DRAW_FLOWCHART, WRITE_FLOW_AST, WRITE_CFG,
                               ATTACH_COMMENTS, WRITE_DATA_STRUCTURES,
                               BUILD_PROGRAM_DEPENDENCIES, COMPARE_CODE,
                               EXPORT_UNIFIED_TO_JSON, EXPORT_MERMAID)
      -cp, --copyBooksDir=<copyBookDirs>[,<copyBookDirs>...]
                             Copybook directories (repeatable)
  -d, --dialect=<dialect>    The COBOL dialect (COBOL, IDMS)
      -dp, --dialectJarPath=<dialectJarPath>
                             Path to dialect .JAR
  -f, --flowchartOutputFormat=<flowchartOutputFormat>
                             Format of the flowchart output (PNG, SVG)
  -g, --generation=<flowchartGenerationStrategy>
                             The flowchart generation strategy. Valid values
                               are PARAGRAPH, SECTION, PROGRAM, and NODRAW
  -h, --help                 Show this help message and exit.
  -p, --permissiveSearch     Match filename using looser criteria
  -r, --reportDir=<reportRootDir>
                             Output report directory
  -s, --srcDir=<sourceDir>   The Cobol source directory
  -v, --validate             Only run syntax validation on the input
  -V, --version              Print version information and exit.

Command: dependency (WIP)

This command is used to trace the inter-program dependencies starting from a root program. To run this, use something like:

java -jar smojol-cli/target/smojol-cli.jar dependency if-test.cbl --srcDir /Users/asgupta/code/smojol/smojol-test-code --copyBooksDir /Users/asgupta/code/smojol/smojol-test-code --dialectJarPath ./che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --export=out.json

Specifying the --neo4j flag injects those dependencies into Neo4J, while setting a path through export will export it to JSON. The screenshot below shows a very simple dependency graph generated from if-test.cbl (in the smojol-test-code directory). Dynamic dependencies (resolved through variables and expressions) are resolved in a very simple heuristic fashion for now; the code looks back upto 5 instructions before the control flow statement to see if there are any literal assignments to the variable holding the call target. If it finds one, that literal value is used.

simple-inter-program-dependency-graph

The help text for this command is reproduced below:

Usage: app dependency [-hnpV] [-d=<dialect>] [-dp=<dialectJarPath>]
                      -s=<sourceDir> [-x=<exportPath>] -cp=<copyBookDirs>[,
                      <copyBookDirs>...] [-cp=<copyBookDirs>[,
                      <copyBookDirs>...]]... <programName>
Implements various operations useful for reverse engineering Cobol code
      <programName>          The program to analyse
      -cp, --copyBooksDir=<copyBookDirs>[,<copyBookDirs>...]
                             Copybook directories (repeatable)
  -d, --dialect=<dialect>    The COBOL dialect (COBOL, IDMS)
      -dp, --dialectJarPath=<dialectJarPath>
                             Path to dialect .JAR
  -h, --help                 Show this help message and exit.
  -n, --neo4j                Export to Neo4J
  -p, --permissiveSearch     Match filename using looser criteria
  -s, --srcDir=<sourceDir>   The Cobol source directory
  -V, --version              Print version information and exit.
  -x, --export=<exportPath>  Export path

Command: validate

As an alternative to using the --validate flag, you can use the validate command explicitly, like so:

java -jar smojol-cli/target/smojol-cli.jar validate minimum.cbl test-exp.cbl --srcDir smojol-test-code/ --copyBooksDir "/path/1/copybooks,/path/1/copybooks" --dialectJarPath che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --dialect IDMS --output out/validation.json

If you specify a path to output, the results will be written to the corresponding file as JSON.

By default, this is non-strict validation, i.e., invalid variable references are reported, but do not cause failure. If the --strict flag is also specified, invalid variable references in the program will cause the task to fail.

The help text is reproduced below:

Usage: app validate [-hptV] [-d=<dialect>] [-dp=<dialectJarPath>]
                    [-o=<outputPath>] -s=<sourceDir> -cp=<copyBookDirs>[,
                    <copyBookDirs>...] [-cp=<copyBookDirs>[,
                    <copyBookDirs>...]]... [<programNames>...]
Validates the candidate COBOL code
      [<programNames>...]    The programs to analyse
      -cp, --copyBooksDir=<copyBookDirs>[,<copyBookDirs>...]
                             Copybook directories (repeatable)
  -d, --dialect=<dialect>    The COBOL dialect (COBOL, IDMS)
      -dp, --dialectJarPath=<dialectJarPath>
                             Path to dialect .JAR
  -h, --help                 Show this help message and exit.
  -o, --output=<outputPath>  Validation results output path
  -p, --permissiveSearch     Match filename using looser criteria
  -s, --srcDir=<sourceDir>   The Cobol source directory
  -t, --strict               Force strict validation, verify all variable
                               usages are valid
  -V, --version              Print version information and exit.

Command: interpret

To run the interpreter, use the interpret command, like in the example below. Most of the options overlap with other commands.

java -jar smojol-cli/target/smojol-cli.jar interpret test-exp.cbl --srcDir /Users/asgupta/code/smojol/smojol-test-code --copyBooksDir "/Users/asgupta/code/smojol/smojol-test-code" --dialectJarPath /Users/asgupta/code/smojol/che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --dialect COBOL --resolveTactic=CONSOLE

The resolveTactic parameters are as below:

The help text for the interpret command is reproduced below.

Usage: app interpret [-hpV] [-d=<dialect>] [-dp=<dialectJarPath>]
                     -s=<sourceDir> [-t=<resolutionTactic>] -cp=<copyBookDirs>[,
                     <copyBookDirs>...] [-cp=<copyBookDirs>[,
                     <copyBookDirs>...]]... <programName>
Interprets the COBOL source
      <programName>          The program to analyse
      -cp, --copyBooksDir=<copyBookDirs>[,<copyBookDirs>...]
                             Copybook directories (repeatable)
  -d, --dialect=<dialect>    The COBOL dialect (COBOL, IDMS)
      -dp, --dialectJarPath=<dialectJarPath>
                             Path to dialect .JAR
  -h, --help                 Show this help message and exit.
  -p, --permissiveSearch     Match filename using looser criteria
  -s, --srcDir=<sourceDir>   The Cobol source directory
  -t, --resolveTactic=<resolutionTactic>
                             The condition resolution strategy (YES, NO,
                               CONSOLE, EVAL)
  -V, --version              Print version information and exit.

Programmatic Usage

NOTE: The API is under active development, and may be subject to change.

The simplest way to invoke tasks associated with the CodeTaskRunner through the API is using CodeTaskRunner, like so:

        String programName = "implicit-loop.cbl";
        Map<String, List<AnalysisTaskResult>> result = new CodeTaskRunner("/path/to/src",
                "/path/to/report",
                ImmutableList.of(new File("/path/1/to/cpy"),
                        new File("/path/2/to/cpy"),
                new File("/path/3/to/cpy")),                "/path/to/dialect-idms.jar",
                LanguageDialect.IDMS, new FullProgram(FlowchartOutputFormat.MERMAID), 
                new UUIDProvider(), 
                new OccursIgnoringFormat1DataStructureBuilder(), 
                new ProgramSearch(), 
                new LocalFilesystemOperations())
                    .runForPrograms(ImmutableList.of(
                        BUILD_BASE_ANALYSIS, 
                        BUILD_TRANSPILER_FLOWGRAPH), 
                        ImmutableList.of(programName));
        List<AnalysisTaskResult> results = result.get(programName);
        TranspilerFlowgraph transpilerFlowgraph = ((AnalysisTaskResultOK) results.get(1)).getDetail();
        TranspilerNode tree = transpilerFlowgraph.transpilerTree();

The above performs the base analysis and then the actual analysis we are interested, namely, building the transpilation model. There are a lot of dependencies needing to be specified as of now; simpler defaults will be added going forward.

Depending upon the number of tasks invoked, the result will contain a list of AnalysisTaskResult objects, which can be either AnalysisTaskResultOK or AnalysisTaskResultError. You can use them to determine what you want to do.

Please note that ``BUILD_BASE_ANALYSIS will always be the first task to be run before any of the following tasks, whether specified or not. Thus, the results of any actual analysis will always start from the second element.

This invocation uses some specific conventions when deciding where to output file artifacts under the report-dir directory. If you want more fine-grained control of the location of output artifacts, you can use the SmojolTasks class, which gives you more configurability in exchange for having to provide more detailed specifications.

NOTE: For all analyses, specifying the OccursIgnoringFormat1DataStructureBuilder class is preferable to prevent extra noise that can be generated when creating arrays for structures using OCCURS clauses. However, the DefaultFormat1DataStructureBuilder should be specified when running the interpreter, because that will require the correct number of elements in array data structures.

Programmatic examples are provided in the following classes.

Logging Settings

You can specify a custom logging settings file by adding -Djava.util.logging.config.file option. if not specified, a default logging.properties will be loaded, with INFO as the default level.

Catalogue of Reusable Algorithms and Data Structures

This is a list of algorithms written from scratch, for reference or reuse. All of them use JGraphT for representing graph structures.

A Note on Copyright

Caveats

Known Issues

References

The rest of this file is mostly technical notes for my personal documentation.

Valid Type Specifications for External Zoned Decimal and Alphanumeric

Sym-1 / Sym-2 S (Sign) P (Left) P (Right) V (Decimal Point) X (Alphanumeric) 9 (Number)
S (Sign) - X X X - -
P (Left) - X - X - X
P (Right) - - X - - -
V (Decimal Point) - - X - - X
X (Alphanumeric) - - - - X X
9 (Number) - - X X X X

Control Flow Notes

Use the following command to build the Graphviz flowchart:

dot -Kdot -v5 -Gsize=200,200\! -Goverlap=scale -Tpng -Gnslimit=4 -Gnslimit1=4 -Gmaxiter=2000 -Gsplines=line dotfile.dot -oflowchart-level5.png

These are some other commands tried on larger graphs:

This prints out all levels

dot -Kdot -v5 -Gsize=200,200\! -Goverlap=scale -Tpng -Gnslimit=4 -Gnslimit1=4 -Gmaxiter=2000 -Gsplines=line dotfile.dot -oflowchart-level5.png