chipsalliance / verible

Verible is a suite of SystemVerilog developer tools, including a parser, style-linter, formatter and language server
https://chipsalliance.github.io/verible/
Other
1.25k stars 195 forks source link

[WIP][Discussion] Standalone preprocessor tool design #1358

Open karimtera opened 1 year ago

karimtera commented 1 year ago

Standalone Preprocessor Tool Design

Introduction

The goal of this thread is to discuss and put the outlines of the standalone preprocessor tool.

The preprocessor tool should fully support the directives described in 2018 SV-LRM, and provide APIs that can be easily used by other Verible tools.

This proposal is a continuation on the already built pseudo-preprocessing tool.

Flags description

Using the Abseil flags library allows an easy access to flags and their arguments.

Flags Type Default Description State
--strip_comments bool false This replaces comments with equal number of whitespaces, to preserve the same byte offsets. DONE
--generate_variants bool false This generates all the possible variants for conditionals. DONE
--multiple_cu bool true In case of true multiple files are passed to the tool, each one of them will be compiled in a separate compilation unit, thus declarations scopes will end by the end of each file (this is the default behavior), otherwise all files will be compiled in the same compilation unit. DONE

Positional arguments

Those mainly are the SV files to preprocess, defines and search-directory paths for includes. Any of the following positional argument can be passed one or more times.

Argument Description State
Specifies a SV file to be preprocessed. DONE
+define+\[=\] Defines one or more macros same as `define foo text DONE
+incdir+\ Specifies the directory to search for files included with `include "file_name" DONE

(IN-PROGRESS)

tgorochowik commented 1 year ago

These parameters look good to me. @hzeller @fangism please comment if you have some additional ideas.

One thing that we should consider early on is how to handle conditionals in the code, here are some notes from my discussion with @mglb

Consider following code:

module mod_name (input bus_a, output bus_b
`ifdef USE_PG_PIN
    , input pg_pin
`endif
    );
endmodule

To parse this into a tree, Verible could preprocess the code and generate two sources: one for condition being true, one for condition being false. Both sources would be parsed into alternate trees. The trees would be then merged to share common branches. Alternate branches would be wrapped in a dedicated node. The preprocessor lines (and inactive in a given variant conditional code blocks) could be treated like comments for parsing and formatting purposes.

What is needed

Note that the integration with linter and formatter can be done at a later step, but we have to keep in mind that this is the ultimate goal so we have to think about how to make that as easy as possible with the implementation we're doing

tgorochowik commented 1 year ago

And in general, the standalone tool should take raw systemverilog sources as input and ouput preprocessed code (only one branch of the conditional code based on the defines, macros should be resolved etc).

I suggest we start working on this one by one, e.g: pick the first feature you want to work on - I think conditionals are a great candidate - scan the tree for all conditional statements, create the trees as suggested above (or suggest an alternative solution), as the last step, look up all the defines (you can start with defines provided from the commandline and store info from `define XXX as the next step) and output only the selected branch in the final code.

karimtera commented 1 year ago

Hello All,

I am currently working on conditionals feature in the standalone tool PR #1360, I am facing a problem, and would like to know what you think about my solution before implementing it.

Problem: The verilog::VerilogPreprocess class has its only interface (public) method ScanStream which operates on verible::TokenStreamView that doesn't contain any white-space characters.

Using this method allows the tool to correctly choose the matching branch if a conditional is encountered, however it makes it hard to preserve the original white-space characters to output the SV source code as it was before preprocessing.

This problem will be even harder when ExpandMacro method is merged into Verible.

Solution:

  1. Adding a verilog::VerilogPreprocess::Config field called bypass_whitespaces that allows verilog::VerilogPreprocess to operate on token stream-views containing white-space characters by adding this behavior to methods that handle each token (not so smart? but works).
  2. Having the original SV code token stream-view (with white-spaces), and the preprocessed token steram-view, a two pointers algorithm can be implemented to solve this situation (smarter, but would be a problem in case of expanded macros with arguments).

Do you have any suggestions? Also, is there any existing utility that can help in this situation?

Thanks

fangism commented 1 year ago

Hello All,

I am currently working on conditionals feature in the standalone tool PR #1360, I am facing a problem, and would like to know what you think about my solution before implementing it.

Problem: The verilog::VerilogPreprocess class has its only interface (public) method ScanStream which operates on verible::TokenStreamView that doesn't contain any white-space characters.

Using this method allows the tool to correctly choose the matching branch if a conditional is encountered, however it makes it hard to preserve the original white-space characters to output the SV source code as it was before preprocessing.

This problem will be even harder when ExpandMacro method is merged into Verible.

Solution:

  1. Adding a verilog::VerilogPreprocess::Config field called bypass_whitespaces that allows verilog::VerilogPreprocess to operate on token stream-views containing white-space characters by adding this behavior to methods that handle each token (not so smart? but works).
  2. Having the original SV code token stream-view (with white-spaces), and the preprocessed token steram-view, a two pointers algorithm can be implemented to solve this situation (smarter, but would be a problem in case of expanded macros with arguments).

Do you have any suggestions? Also, is there any existing utility that can help in this situation?

Thanks

Hello! If you need to preserve whitespaces between tokens, but are only given the non-whitespaces tokens, you can always "restore" the gaps by constructing a string_view that spans the gaps between tokens with ranges like prev.end(), next.begin().

karimtera commented 1 year ago

And in general, the standalone tool should take raw systemverilog sources as input and ouput preprocessed code (only one branch of the conditional code based on the defines, macros should be resolved etc).

I suggest we start working on this one by one, e.g: pick the first feature you want to work on - I think conditionals are a great candidate - scan the tree for all conditional statements, create the trees as suggested above (or suggest an alternative solution), as the last step, look up all the defines (you can start with defines provided from the commandline and store info from `define XXX as the next step) and output only the selected branch in the final code.

I added a new command for the tool generate-variants which builds the control flow tree, and traverse it simply by dfs to generate and output all possible variants.

Example:

`define BAZ 1
`define MAZ 1
`ifdef BAZ
  module foo(); endmodule
  `ifdef MAZ
    module moo(); endmodule
  `else
    module not_moo(); endmodule
  `endif
  module foo2(); endmodule 
`else
  module not_foo(); endmodule
`endif
module m(); endmodule
Output: ````` Variant number 1: (#359: "module") (#293: "not_foo") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "m") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") Variant number 2: (#359: "module") (#293: "foo") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "not_moo") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "foo2") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "m") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") Variant number 3: (#359: "module") (#293: "foo") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "moo") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "foo2") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") (#359: "module") (#293: "m") (#40: "(") (#41: ")") (#59: ";") (#336: "endmodule") `````

These variants are ready to be passed to the parser without any needed modification.

Thanks,

karimtera commented 1 year ago

Hello All,

I am currently working on conditionals feature in the standalone tool PR #1360, I am facing a problem, and would like to know what you think about my solution before implementing it.

Problem: The verilog::VerilogPreprocess class has its only interface (public) method ScanStream which operates on verible::TokenStreamView that doesn't contain any white-space characters.

Using this method allows the tool to correctly choose the matching branch if a conditional is encountered, however it makes it hard to preserve the original white-space characters to output the SV source code as it was before preprocessing.

This problem will be even harder when ExpandMacro method is merged into Verible.

Solution:

  1. Adding a verilog::VerilogPreprocess::Config field called bypass_whitespaces that allows verilog::VerilogPreprocess to operate on token stream-views containing white-space characters by adding this behavior to methods that handle each token (not so smart? but works).

  2. Having the original SV code token stream-view (with white-spaces), and the preprocessed token steram-view, a two pointers algorithm can be implemented to solve this situation (smarter, but would be a problem in case of expanded macros with arguments).

Do you have any suggestions? Also, is there any existing utility that can help in this situation?

Thanks

Hello!

If you need to preserve whitespaces between tokens, but are only given the non-whitespaces tokens, you can always "restore" the gaps by constructing a string_view that spans the gaps between tokens with ranges like prev.end(), next.begin().

Sorry, I didn't understand your idea, can you elaborate more?

Thanks

fangism commented 1 year ago

Hello All,

I am currently working on conditionals feature in the standalone tool PR #1360, I am facing a problem, and would like to know what you think about my solution before implementing it.

Problem: The verilog::VerilogPreprocess class has its only interface (public) method ScanStream which operates on verible::TokenStreamView that doesn't contain any white-space characters.

Using this method allows the tool to correctly choose the matching branch if a conditional is encountered, however it makes it hard to preserve the original white-space characters to output the SV source code as it was before preprocessing.

Hello! If you need to preserve whitespaces between tokens, but are only given the non-whitespaces tokens, you can always "restore" the gaps by constructing a string_view that spans the gaps between tokens with ranges like prev.end(), next.begin().

Sorry, I didn't understand your idea, can you elaborate more?

Between any two tokens A, B that belong to the same memory buffer (which is true in Verible), you can construct a string_view of the characters between them, using one of these constructors:

https://en.cppreference.com/w/cpp/string/basic_string_view/basic_string_view https://github.com/abseil/abseil-cpp/blob/master/absl/strings/string_view.h

It appears the two-iterator range constructor isn't available until C++20, so something like this will work:

  // assuming A is before B in the same memory buffer
  absl::string_view text_between_A_and_B(A.end(), std::distance(A.end(), B.begin()));
mglb commented 1 year ago

The range between tokens can contain preprocessor directives and inactive code, so it can't be used in all cases. Example:

····module foo(); endmodule/*1*/

`ifdef MAZ
····module moo(); endmodule
`endif
····/*2*/module foo2(); endmodule 

For a variant where MAZ is not defined, /*1*/ and /*2*/ are consecutive tokens separated by 2 line breaks and 4 spaces, but the text between them is not just a whitespace. To make things worse, part of the whitespace is attached (in the text) to /*1*/, and the other part to /2/.


  1. Adding a verilog::VerilogPreprocess::Config field called bypass_whitespaces that allows verilog::VerilogPreprocess to operate on token stream-views containing white-space characters by adding this behavior to methods that handle each token (not so smart? but works).

Do you mean you want to add a new token type (token_enum value in TokenInfo) that represents whitespace? Or just include whitespace as a part of existing tokens? If the former, why does it need a config? It sounds good, whitespace tokens could probably be handled similarly to comments. If the latter, it could work if handled in other places (might be a lot of work though).

  1. Having the original SV code token stream-view (with white-spaces), and the preprocessed token steram-view, a two pointers algorithm can be implemented to solve this situation (smarter, but would be a problem in case of expanded macros with arguments).

I'm not sure I understand how would it work. Could you elaborate?

karimtera commented 1 year ago

Hello @mglb,

Thanks for your feedback!

Do you mean you want to add a new token type (token_enum value in TokenInfo) that represents whitespace? Or just include whitespace as a part of existing tokens? If the former, why does it need a config? It sounds good, whitespace tokens could probably be handled similarly to comments. If the latter, it could work if handled in other places (might be a lot of work though).

I meant the latter, and yes It's a lot of work. But, I am still figuring out how to do your first suggestion.

  1. Having the original SV code token stream-view (with white-spaces), and the preprocessed token steram-view, a two pointers algorithm can be implemented to solve this situation (smarter, but would be a problem in case of expanded macros with arguments).

I'm not sure I understand how would it work. Could you elaborate?

I meant to have a loop with 2 pointers (initially pointing to the beginning of both streams), let's assume: ptr_ws is pointing to the stream with white-spaces, while ptr_pp is pointing to the preprocessed stream.

The idea was to:

This is okay just when the non white-space tokens remain unchanged in both streams, but in case of an expanded macros, It might be a challenge.

karimtera commented 1 year ago

Hi @fangism, @mglb,

I added a comment in PR #1360 that describes my approach to the white-space problem we discussed.

I would appreciate if you can take a look, Thanks!

mglb commented 1 year ago

I added a comment in PR #1360 that describes my approach to the white-space problem we discussed.

I would appreciate if you can take a look, Thanks!

I'll look at it closely tomorrow

hzeller commented 1 year ago

BTW, for finding some interesting examples and 'good to evil' uses of the verilog preprocessor, always good to have a look at Wilson's paper https://veripool.org/papers/Preproc_Good_Evil_SNUGBos10_paper.pdf It might give some ideas for test-cases. (Wilson is the person behind Verilator).

karimtera commented 1 year ago

I read it when I was getting to know the project, didn't benefit from it a lot then. I will take another look now.