antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.15k stars 3.7k forks source link

Not getting whitespace in extracted C code. #3800

Open yash745-deloitte opened 12 months ago

yash745-deloitte commented 12 months ago

Hi all,

I have been trying to extract C code using this C Grammar.

But facing issue in whitespace part. Whitespace is missing in the extracted code. The sample extracted code is given below:-

Code:- Imports: []

Variables: ['inta=5;', 'structPerson{charname[50];intage;floatheight;};', 'intx=10;']

Functions: ['main()', 'car()']

Function Implementations: ['intmain(){intx=10;printf("Hello, world!");return0;}', 'voidcar(){printf("Chain kuli ki man kuli");}']

Struct Declarations: ['structPerson{charname[50];intage;floatheight;}']


The python code which I'm using for above extraction is given below:-

` from antlr4 import * from cGrammarListener import cGrammarListener from cGrammarParser import cGrammarParser from cGrammarLexer import cGrammarLexer

class CDetailsListener(cGrammarListener): def init(self): self.imports = [] self.variables = [] self.functions = [] self.function_implementations = [] self.struct_variable = []

  def enterPreprocessorDirective(self, ctx):
      self.imports.append(ctx.getText())

  def enterDeclaration(self, ctx):
      self.variables.append(ctx.getText())

  def enterFunctionDefinition(self, ctx):
      self.functions.append(ctx.declarator().getText())
      self.function_implementations.append(ctx.getText())

  def enterStructOrUnionSpecifier(self, ctx):
      self.struct_variable.append(ctx.getText())

def extract_details_from_c_code(): lexer = cGrammarLexer(FileStream("C/sample.c")) stream = CommonTokenStream(lexer) parser = cGrammarParser(stream)

  # Parse the code and generate a parse tree
  tree = parser.compilationUnit()

  # Initialize the listener and traverse the parse tree
  listener = CDetailsListener()
  walker = ParseTreeWalker()
  walker.walk(listener, tree)

  # Access the extracted details
  imports = listener.imports
  variables = listener.variables
  functions = listener.functions
  function_implementations = listener.function_implementations
  struct_variable = listener.struct_variable

  return imports, variables, functions, function_implementations, struct_variable

Example usage

imports, variables, functions, function_implementations, struct_variable = extract_details_from_c_code()

print("Imports:") print(imports)

print("\nVariables:") print(variables)

print("\nFunctions:") print(functions)

print("\nFunction Implementations:") print(function_implementations)

print("\nStruct Declarations:") print(struct_variable) `

If anyone has the workaround on how to resolve this issue, please response. It will be a great help. If any further queries or doubts, please feel free to ask.

kaby76 commented 12 months ago

GetText() gets the text of the tree node that is in the parse tree. Off-channel tokens, or chars that are in the input char buffer that aren't in a token, are not added to the text that GetText() reconstructs. So, all tokens that are channel HIDDEN (=2) don't appear in the parse tree, and don't appear in the reconstructed text. Read the code for GetText() in the runtime. That reconstructs the text for the tree node by just recursively calling the method for all children, and returning the concatenation of that text--and not the off-channel tokens.

You will need to write your own code to reconstruct the text you want. This is easy because you can get the location of the index of the token for the extreme left and right leaf nodes of the tree. Then, you could get the text one of two ways:

(a) Write a for-loop to go through each token between start and end to print out the text of the token on the token stream corresponding to the text of the tree, including channel HIDDEN. (b) For some grammars, "skip" is used. These don't appear whatsoever on the token stream, so you will need to work with the char buffer itself. But, you don't have that problem with this grammar, and most grammars in grammars-v4 have been adjusted to not use "skip".

So, this isn't a bug with Antlr, nor the C-grammar. But, you're not the only one that "discovered" this problem. This is one of the things I don't care for in Antlr. In my Antlr Toolkit Trash, I rewrite all the trees to include "off-channel" text in the tree. It allows for a cleaner way to query the tree using XPath expressions, and modifications using XQuery.