Closed udif closed 1 year ago
Regrading the code, I added a new constructor for StringTable<>
and merged the lookup of the static directives table in LexerFacts.cpp with a dynamic table built by the command line options.
I did this in order to minimize the runtime impact of looking at two lists, one of legal directives, and one of directives to be ignored. As a result, not using this function should have zero impact on runtime.
rebase due to unrelated CI failure on macos (maybe due to not being on master when releasing the PR).
According to the github action logs, build actually broke due to https://github.com/MikePopoloski/slang/commit/642ef985fae186818b8bfce4ba1d5ddfab8a4955
The build error should be fixed now, please rebase.
I think this could be done much more simply by having the ignore list in the Preprocessor instead, and looking at it separately in handleTopLevelMacro() if we fail to look up a macro usage. You could then check if the "macro usage" is instead one of your ignored directives and consume tokens if so. That way you don't need to modify StringTable, create new syntax kinds, etc. This doesn't add any overhead unless a macro usage is not found, which is either an error case and so not expected in valid code or it's a usage of this new feature.
rebased due to CI failure fix.
I've pushed a new version that works at the preprocessor level as suggested. It passes the same manual tests appearing in the top post.
The only remaining issue is how to handle the ignored directives. My intent was to save them as trivia, but I wasn't sure exactly how to do this, as the original code returned a nullptr
instead of a trivia in a token when the macro was not recognized.
I fixed locally all the trivial changes (comments, casts), but I still need to go over the Trivia and Token classes and member functions, to better understand them.
I'm still working on this.
What blocks me at the moment is the code to get a string_view
of the line being skipped (after the skipped directive).
I'm trying different approaches as you can see in the code below (full of debug prints):
if (options.ignoreDirectives.find(directive.valueText().substr(1)) !=
options.ignoreDirectives.end()) {
auto start_offset = peek().location().offset();
std::cout << "Skipped directive\n";
const char* start =currentToken.rawText().data();
std::cout << "start loc:" << start_offset << std::endl;
while (currentToken.kind != TokenKind::EndOfFile && peekSameLine()) {
std::cout << "consume:" << currentToken.toString() << std::endl;
consume();
}
const char* end = peek().rawText().data();
auto end_offset = currentToken.location().offset();
std::cout << "size: " << (end-start) << std::endl;
std::cout << "end loc:" << end_offset << std::endl;
return std::make_pair(nullptr, Trivia(TriviaKind::SkippedTokens, string_view(start, end - start)));
}
I can get the text offset in the buffer via peek().location().offset()
(peek()
is used as currentToken
is sometimes null), but I can't get the buffer start address. I can get it via DiagnosticClient::getSourceLine
but I need a pointer to an object of this class.
I also tried peek().rawText().data()
but it seems to give a pointer to an arbitrary pointer (temporary?), not a pointer into the source buffer, as subtracting the end pointer from the start pointer gives a random size, sometimes negative.
What is the simplest way to get the text between two Token
objects? (assuming they are from the same source buffer)
rawData() should return a pointer into the source buffer (not a temporary). Possibly you're not looking at the right buffer start pointer?
Regardless though you shouldn't need this; for SkippedToken trivia you can use the Trivia() constructor that takes a list of tokens skipped instead of trying to compute the raw text (which doesn't in general work anyway since the tokens that come from, say, a macro expansion will reside in a different buffer).
What is the simplest way to check if the Trivia
nodes are added correctly? I noticed the rewriter
tool doesn't take any command line arguments so I can't pass any --ignore-directive
flag there. The slang JSON dump don't list those either.
Other than the testing stuff though this looks good to me.
OK, I have this test:
TEST_CASE("Unknown but ignored directive") {
auto& text = R"(
`unknown_pragma xyz abc 123
)";
PreprocessorOptions ppOptions;
ppOptions.ignoreDirectives.emplace("unknown_pragma");
Bag options;
options.set(ppOptions);
auto result = preprocess(text, options);
CHECK(result == "\n");
REQUIRE(diagnostics.size() == 0);
auto tree = SyntaxTree::fromText(text, SyntaxTree::getDefaultSourceManager(), "", "", options);
CHECK(SyntaxPrinter::printFile(*tree) == text);
It is passing as long as the skipped preprocessor directive has no arguments. When it has arguments (like in this example), I get a SIGSEGV.
slang::parsing::Token::location(const slang::parsing::Token * this) (\home\udif\git\slang\source\parsing\Token.cpp:274)
slang::parsing::Trivia::getExplicitLocation(const slang::parsing::Trivia * this) (\home\udif\git\slang\source\parsing\Token.cpp:122)
slang::syntax::SyntaxPrinter::print(slang::syntax::SyntaxPrinter * this, slang::parsing::Token token) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:70)
slang::syntax::SyntaxPrinter::print(slang::syntax::SyntaxPrinter * this, const slang::syntax::SyntaxNode & node) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:99)
slang::syntax::SyntaxPrinter::print(slang::syntax::SyntaxPrinter * this, const slang::syntax::SyntaxTree & tree) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:105)
slang::syntax::SyntaxPrinter::printFile[abi:cxx11](slang::syntax::SyntaxTree const&)(const slang::syntax::SyntaxTree & tree) (\home\udif\git\slang\source\syntax\SyntaxPrinter.cpp:117)
CATCH2_INTERNAL_TEST_244() (\home\udif\git\slang\tests\unittests\PreprocessorTests.cpp:2221)
...
Apparently, it is looking for an "explicit location" of the trivia in info->location
.
Bottom line, your suggestion works. Before trying that, I spent an hour writing down why this shouldn't matter.
1 Hour earlier...
I made a small change to the code above, replacing ignore_tokens
with a different std::vector<Token>
built using a copy constructor, as well as breaking it into two lines so I can more easily examine the return value in the debugger.
else {
auto rval = std::make_pair(nullptr, Trivia(TriviaKind::SkippedTokens, std::vector<Token>(ignore_tokens)));
return rval;
}
I then checked rval.second.tokens.ptr
and rval.second.tokens.len
then compared it to actualArgs.second.tokens.ptr
and actualArgs.second.tokens.len
in the calling function, handleMacroUsage
:
const std::pair<MacroActualArgumentListSyntax*, Trivia>& actualArgs = handleTopLevelMacro(
directive);
The result was that the ptr & len in both cases was identical, and was heap allocated (stack addresses start at 0x8000000000000000 grow downwards, while for heap I see 0x555555... )
I then removed the copy-constructor I added above, and used ignore_tokens
directly as before.
what I see is that ignore_tokens
is built on the stack, but it's data array , &ignore_tokens[0]
is allocated on the heap (which is expected, given that it needs to be able to grow). Following this back one level up, actualArgs.second.tokens.ptr
is holding the same heap address as &ignore_tokens[0]
.
Back to present time...
So even in the original case, the Token*
pointer in Trivia
is pointing to a heap-allocated object, so why did replacing it with SmallVectorSized
and calling it's copy
method made a difference?
To be more clear about my previous comment and why this doesn't work -- yes, the vector allocates its memory on the heap, but when the vector itself goes out of scope its destructor is called and the memory is freed. The Trivia object takes a span<> pointing to the memory containing the skipped tokens -- once the memory is freed that pointer is no longer valid. Running a copy constructor doesn't help you because the copied object will be destroyed at the end of that line of code and again the pointed to memory will be invalid.
SmallVector works here because the copy() method copies the memory into the provided allocator object, which has an appropriate lifetime and will keep it around for as long as needed. This is achievable without SmallVector of course by just using the allocator object directly and doing the copy yourself but this way is more convenient.
OK, I missed that.
When is the memory allocated by the bump allocated freed? Looking around at the code, I assume that all memory allocated inside the preprocessor using the alloc
object, is bulk-freed at the end of preprocessing when called by driver::runPreprocessor
or at the end of parse tree creation if called by SyntaxTree::create
?
I've also tried to do the same with new
(which means the object would be lost since no one would free it), only to find out there is no way to do copy initialization as part of object allocation by new
.
Given that this is already fixed and committed, Is there anything else blocking this PR?
The BumpAllocator lives inside the SyntaxTree that owns all the memory for that tree. When it gets destroyed all memory related to that tree gets freed (including preprocessor stuff, tokens, syntax nodes, etc).
Some tools have non-standard compiler directives. In order to be able to filter those specific directives in legacy code, but not to miss undefined macros or spelling mistakes, a new
--ignore-directive <directive>
command line option was added.Here is an example:
Test code:
No flags:
New flag: