Open glenn-slayden opened 5 years ago
Thanks. Very nice analysis. I assume it's been that way forever, so not a high priority to fix, but we'll probably accept a PR for it.
ILAsm currently cannot be built by outsiders due to a dependency on a Microsoft-internal tool "VCBU Internal YACC," so this might make PR more difficult...
@glenn-slayden dotnet/runtime#4776 has some info on using btyacc
to prototype a fix. If the PR is otherwise correct, we'd be happy to regenerate the yacc output using the internal tool if we have to. Obviously, we should figure out how to fix this yacc dependency, but it's not high on the priority list, unfortunately.
Fixing this may be subtle. The current ILAsm behavior is that block-style comments /* ... */
generally trump everything (except single line comment // ...
). For the particular case here, /* #if... */
will always vacate the preprocessor directive (regardless of, i.e., sequential order-of-appearance between the two, meaning #if /* ... #endif */
leaves the preprocessor dangling). Though perhaps not entirely ideal, this greedy design is at least defensible.
But I digress. As described in the issue, what's unexpected is that string literals are not interpreted at all when /* ...
is searching for its closing */
, and that seems patently wrong.
The caution I wanted to add was that there's no requirement that any appearance of the double-quote character within a block-comment be delimiting some well-formed string literal. For example, the following currently works fine, a behavior which probably shouldn't be changed:
ldstr /* "incomplete string */ "foo bar" // ok
So for fixing this issue: Searching for the closing */
within a (formative) block comment should only exclude (skip over) the contents of string literals which can be deemed "well-formed" according to some appropriately weak notion.
Other examples:
/*
and */
are evaluated in disabled #ifdef
blocks (the 2nd #endif
is required in the following):
ldarg #ifdef EXCLUDE / #endif / garbage #endif the_arg // ok --> 'ldarg the_arg'
#if...
directives are never evaluated within /* ... */
(i.e., don't need any #endif
in the following):
ldarg / #ifdef EXCLUDE / the_arg // ok
The problem described above does not occur with single-line comments. All of the following are ok:
ldc.i4.0 // "foo /* foo" // ok
ldc.i4.0 // foo /* foo // ok
ldc.i4.0 // foo /* "foo // ok
#ifdef __EXCLUDE__
ldc.i4.0 // "foo /* foo" // ok (excluded)
ldc.i4.0 // foo /* foo // ok (excluded)
ldc.i4.0 // foo /* "foo // ok (excluded)
#endif
So single-line comment is the only way to hide /*
(no block comment is opened in the following):
ldarg the_arg // foo /* ok
But, block comment can hide //
(closing */
is effective in the following):
ldarg /* foo // */
the_arg // ok
// ...and not...
ldarg /* foo // */ // NO
*/ t // excess block-comment close
Hints of "rock-paper-scissors?" Points (4) and (5) show that single-line comments have the same precedence as block comments. Only between these two is the relative priority established by forward processing according to sequential order of appearance in the source file.
Obviously, we should figure out how to fix this yacc dependency, but it's not high on the priority list, unfortunately.
@RussKeldorph Actually, I'd say that IL(D)Asm are both long overdue for full modernization / re-implementation from scratch in C#. Is that a PR you'd accept?
@glenn-slayden I think so. We've certainly talked about it for years. As long as it came with sufficient tests to verify near-conformance (possibly even bug-for-bug compat in some cases) with existing tools, we'd love to get rid of this unnecessarily native baggage. It would make it more convenient to integrate these tools into the SDK, and it would probably be easier for repos like Roslyn that use ILAsm
as a validation tool.
FYI @jaredpar says https://github.com/mono/mono/tree/master/mcs/ilasm is pretty far from full conformance with ilasm corner cases, but it might be a place to start. If that became conformant, I assume it would become the "one true" ilasm and we could delete the source entirely from this repo.
/cc: @jkotas in case I'm too far into wishful thinking
After doing some preliminary investigations on this, it occurs to me that perhaps F# rather than C# might be a more elegant choice for implementing an (e.g.,) LALR parser such as (e.g.,) ILAsm, perhaps because the functional programming approach itself more closely models the declarative form of a BNF--or, more generally, any "formal"--language grammar?
As with the .NET ILAsm, the mono version that @jaredpar pointed to is fundamentally Yacc-based. I cleaned up that C#-based IL assembler source quite a bit for vs2017 and it builds, runs, and then builds PEs that work fine on .NET Framework 4.7.2, at least for the simple tests I've tried so far. The snapshot I linked consolidates all minimal dependencies; in particular it includes the complete/updated/working C sources and vcxproj
for the jay.exe
tool which generates the C# parser source file ILParser.jay.cs
from the declarative grammar definition file ILParser.jay
.
Although that code will be a useful reference, I'm not sure I will be able to fully use it as a starting point for what I'm working on, where I'm hoping to build and maintain a mutable Abtract Semantic Graph that's bidirectionally associated with the "live" IL source. That goal appears to need fundamentally different data structures and design, as compared to the traditional grammar approach which focuses on the (very different) one-time, one-way parsing scenario, and around which the Yacc tools and traditions have evolved.
Fixing this would require rewriting the assembler in a new language other that MS yacc/lex.
Using a different YACC tool, as mentioned from https://github.com/dotnet/runtime/issues/4776 , we might be able to address this more easily depending on how flexible the tool is.
ILAsm.exe incorrectly opens a C-style block comment
/* ... */
when the opening sequence/*
appears in a string literal that occurs within the disabled portion(s) of an#ifdef ... #else ... #endif
preprocessor block.The two code examples shown below are identical except the latter contains an
#ifdef
block illustrating the problem. The first example works correctly, but the second causes ILAsm.exe to fail.In the second example, since the symbol
__NOT_DEFINED__
is not defined, the first part of the#ifdef
block should be ignored. The excluded part of the#ifdef
block contains a string literal which happens to contain the block comment opening sequence/*
.app_ok.il
:result - ok :
app_fail.il
:result - failure :
Error:
*ILAsm.exe incorrectly considers the text "
/*
", which appears within the string literal, to be an active block comment opening sequence.*This only occurs within disabled preprocessor blocks. If the
/*
portion of the text is removed from the failure example, or if the (incorrectly-opened) block comment is "properly" closed via a*/
sequence within the disabled section (this can be either within the string literal or subsequently), then the error goes away and the ILAsm.exe command succeeds.Expected:
The happenstance contents of a string literal should not affect the success or failure of an ILAsm.exe invocation.
Notes:
Within preprocessor blocks that are enabled (active), string literals containing the comment sequence are handled correctly (not interpreted).
The same failure also occurs when the
/*
sequence appears outside of a string literal within a disabled preprocessor block, but this condition seems less obviously erroneous.