Add a new '--ignore-directive <directive>' command line option.

udif commented 1 year ago

Some tools have non-standard compiler directives. In order to be able to filter those specific directives in legacy code, but not to miss undefined macros or spelling mistakes, a new --ignore-directive <directive> command line option was added.

Here is an example:

Test code:

~/git/slang/build$ cat ~/v/d.v
`dummy_directive
`dummy_directive 1

module top;
`dummy_directive2
endmodule

No flags:

~/git/slang/build$ bin/slang  ~/v/d.v
Top level design units:
    top

../../../v/d.v:1:1: error: unknown macro or compiler directive '`dummy_directive'
`dummy_directive
^
../../../v/d.v:2:1: error: unknown macro or compiler directive '`dummy_directive'
`dummy_directive 1
^
../../../v/d.v:2:18: error: expected member
`dummy_directive 1
                 ^
../../../v/d.v:5:1: error: unknown macro or compiler directive '`dummy_directive2'
`dummy_directive2
^

Build failed: 4 errors, 0 warnings

New flag:

~/git/slang/build$ bin/slang --ignore-directive dummy_directive  --ignore-directive dummy_directive2 ~/v/d.v
Top level design units:
    top

Build succeeded: 0 errors, 0 warnings

udif commented 1 year ago

Regrading the code, I added a new constructor for StringTable<> and merged the lookup of the static directives table in LexerFacts.cpp with a dynamic table built by the command line options. I did this in order to minimize the runtime impact of looking at two lists, one of legal directives, and one of directives to be ignored. As a result, not using this function should have zero impact on runtime.

udif commented 1 year ago

rebase due to unrelated CI failure on macos (maybe due to not being on master when releasing the PR).

udif commented 1 year ago

According to the github action logs, build actually broke due to https://github.com/MikePopoloski/slang/commit/642ef985fae186818b8bfce4ba1d5ddfab8a4955

MikePopoloski commented 1 year ago

The build error should be fixed now, please rebase.

I think this could be done much more simply by having the ignore list in the Preprocessor instead, and looking at it separately in handleTopLevelMacro() if we fail to look up a macro usage. You could then check if the "macro usage" is instead one of your ignored directives and consume tokens if so. That way you don't need to modify StringTable, create new syntax kinds, etc. This doesn't add any overhead unless a macro usage is not found, which is either an error case and so not expected in valid code or it's a usage of this new feature.

udif commented 1 year ago

rebased due to CI failure fix.

udif commented 1 year ago

I've pushed a new version that works at the preprocessor level as suggested. It passes the same manual tests appearing in the top post. The only remaining issue is how to handle the ignored directives. My intent was to save them as trivia, but I wasn't sure exactly how to do this, as the original code returned a nullptr instead of a trivia in a token when the macro was not recognized.

udif commented 1 year ago

I fixed locally all the trivial changes (comments, casts), but I still need to go over the Trivia and Token classes and member functions, to better understand them.

udif commented 1 year ago

I'm still working on this. What blocks me at the moment is the code to get a string_view of the line being skipped (after the skipped directive).

I'm trying different approaches as you can see in the code below (full of debug prints):

        if (options.ignoreDirectives.find(directive.valueText().substr(1)) !=
            options.ignoreDirectives.end()) {
            auto start_offset = peek().location().offset();
            std::cout << "Skipped directive\n";
            const char* start =currentToken.rawText().data();
            std::cout << "start loc:"  << start_offset << std::endl;
            while (currentToken.kind != TokenKind::EndOfFile && peekSameLine()) {
                std::cout << "consume:" << currentToken.toString() << std::endl;
                consume();
            }
            const char* end = peek().rawText().data();
            auto end_offset = currentToken.location().offset();
            std::cout << "size: " << (end-start) << std::endl;
            std::cout << "end loc:" << end_offset << std::endl;
            return std::make_pair(nullptr, Trivia(TriviaKind::SkippedTokens, string_view(start, end - start)));
        }

I can get the text offset in the buffer via peek().location().offset() (peek() is used as currentToken is sometimes null), but I can't get the buffer start address. I can get it via DiagnosticClient::getSourceLine but I need a pointer to an object of this class. I also tried peek().rawText().data() but it seems to give a pointer to an arbitrary pointer (temporary?), not a pointer into the source buffer, as subtracting the end pointer from the start pointer gives a random size, sometimes negative.

What is the simplest way to get the text between two Token objects? (assuming they are from the same source buffer)

MikePopoloski commented 1 year ago

rawData() should return a pointer into the source buffer (not a temporary). Possibly you're not looking at the right buffer start pointer?

Regardless though you shouldn't need this; for SkippedToken trivia you can use the Trivia() constructor that takes a list of tokens skipped instead of trying to compute the raw text (which doesn't in general work anyway since the tokens that come from, say, a macro expansion will reside in a different buffer).

udif commented 1 year ago

What is the simplest way to check if the Trivia nodes are added correctly? I noticed the rewriter tool doesn't take any command line arguments so I can't pass any --ignore-directive flag there. The slang JSON dump don't list those either.

MikePopoloski commented 1 year ago

Other than the testing stuff though this looks good to me.

udif commented 1 year ago

OK, I have this test:

TEST_CASE("Unknown but ignored directive") {
    auto& text = R"(
`unknown_pragma xyz abc 123
)";
    PreprocessorOptions ppOptions;
    ppOptions.ignoreDirectives.emplace("unknown_pragma");

    Bag options;
    options.set(ppOptions);

    auto result = preprocess(text, options);
    CHECK(result == "\n");

    REQUIRE(diagnostics.size() == 0);

    auto tree = SyntaxTree::fromText(text, SyntaxTree::getDefaultSourceManager(), "", "", options);
    CHECK(SyntaxPrinter::printFile(*tree) == text);

It is passing as long as the skipped preprocessor directive has no arguments. When it has arguments (like in this example), I get a SIGSEGV.

slang::parsing::Token::location(const slang::parsing::Token * this) (\home\udif\git\slang\source\parsing\Token.cpp:274)
slang::parsing::Trivia::getExplicitLocation(const slang::parsing::Trivia * this) (\home\udif\git\slang\source\parsing\Token.cpp:122)
slang::syntax::SyntaxPrinter::print(slang::syntax::SyntaxPrinter * this, slang::parsing::Token token) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:70)
slang::syntax::SyntaxPrinter::print(slang::syntax::SyntaxPrinter * this, const slang::syntax::SyntaxNode & node) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:99)
slang::syntax::SyntaxPrinter::print(slang::syntax::SyntaxPrinter * this, const slang::syntax::SyntaxTree & tree) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:105)
slang::syntax::SyntaxPrinter::printFile[abi:cxx11](slang::syntax::SyntaxTree const&)(const slang::syntax::SyntaxTree & tree) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:117)
CATCH2_INTERNAL_TEST_244() (\home\udif\git\slang\tests\unittests\PreprocessorTests.cpp:2221)
...

Apparently, it is looking for an "explicit location" of the trivia in info->location.

udif commented 1 year ago

Bottom line, your suggestion works. Before trying that, I spent an hour writing down why this shouldn't matter.

1 Hour earlier... I made a small change to the code above, replacing ignore_tokens with a different std::vector<Token> built using a copy constructor, as well as breaking it into two lines so I can more easily examine the return value in the debugger.

            else {
                auto rval = std::make_pair(nullptr, Trivia(TriviaKind::SkippedTokens, std::vector<Token>(ignore_tokens)));
                return rval;
            }

I then checked rval.second.tokens.ptr and rval.second.tokens.len then compared it to actualArgs.second.tokens.ptr and actualArgs.second.tokens.len in the calling function, handleMacroUsage:

    const std::pair<MacroActualArgumentListSyntax*, Trivia>& actualArgs = handleTopLevelMacro(
        directive);

The result was that the ptr & len in both cases was identical, and was heap allocated (stack addresses start at 0x8000000000000000 grow downwards, while for heap I see 0x555555... ) I then removed the copy-constructor I added above, and used ignore_tokens directly as before. what I see is that ignore_tokens is built on the stack, but it's data array , &ignore_tokens[0] is allocated on the heap (which is expected, given that it needs to be able to grow). Following this back one level up, actualArgs.second.tokens.ptr is holding the same heap address as &ignore_tokens[0].

Back to present time... So even in the original case, the Token* pointer in Trivia is pointing to a heap-allocated object, so why did replacing it with SmallVectorSized and calling it's copy method made a difference?

MikePopoloski commented 1 year ago

To be more clear about my previous comment and why this doesn't work -- yes, the vector allocates its memory on the heap, but when the vector itself goes out of scope its destructor is called and the memory is freed. The Trivia object takes a span<> pointing to the memory containing the skipped tokens -- once the memory is freed that pointer is no longer valid. Running a copy constructor doesn't help you because the copied object will be destroyed at the end of that line of code and again the pointed to memory will be invalid.

SmallVector works here because the copy() method copies the memory into the provided allocator object, which has an appropriate lifetime and will keep it around for as long as needed. This is achievable without SmallVector of course by just using the allocator object directly and doing the copy yourself but this way is more convenient.

udif commented 1 year ago

OK, I missed that.

When is the memory allocated by the bump allocated freed? Looking around at the code, I assume that all memory allocated inside the preprocessor using the alloc object, is bulk-freed at the end of preprocessing when called by driver::runPreprocessor or at the end of parse tree creation if called by SyntaxTree::create ?

I've also tried to do the same with new (which means the object would be lost since no one would free it), only to find out there is no way to do copy initialization as part of object allocation by new.

Given that this is already fixed and committed, Is there anything else blocking this PR?

MikePopoloski commented 1 year ago

The BumpAllocator lives inside the SyntaxTree that owns all the memory for that tree. When it gets destroyed all memory related to that tree gets freed (including preprocessor stuff, tokens, syntax nodes, etc).

MikePopoloski / slang

Add a new '--ignore-directive <directive>' command line option. #620