dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.46k stars 4.76k forks source link

ILAsm: Fail on string literals containing "... /* ..." within disabled #ifdef block #12418

Open glenn-slayden opened 5 years ago

glenn-slayden commented 5 years ago

ILAsm.exe incorrectly opens a C-style block comment /* ... */ when the opening sequence /* appears in a string literal that occurs within the disabled portion(s) of an #ifdef ... #else ... #endif preprocessor block.

The two code examples shown below are identical except the latter contains an #ifdef block illustrating the problem. The first example works correctly, but the second causes ILAsm.exe to fail.

In the second example, since the symbol __NOT_DEFINED__ is not defined, the first part of the #ifdef block should be ignored. The excluded part of the #ifdef block contains a string literal which happens to contain the block comment opening sequence /*.


app_ok.il :

.assembly extern mscorlib { .ver 4:0:0:0 auto }
.assembly app_ok { }
.module app_ok.exe
.class Program
{
    .method static void Main()
    {
        .entrypoint

        ldstr "foo /* bar"
        call void [mscorlib]System.Console::WriteLine(string)
        ret
    }
}

result - ok :

D:\>ilasm /nologo /quiet app_ok.il
D:\>app_ok.exe
foo /* bar
D:\>

app_fail.il :

.assembly extern mscorlib { .ver 4:0:0:0 auto }
.assembly app_fail { }
.module app_fail.exe
.class Program
{
    .method static void Main()
    {
        .entrypoint

#ifdef __NOT_DEFINED__
        ldstr "foo /* bar"
#else
        ldstr "etc. (doesn't matter)"
#endif
        call void [mscorlib]System.Console::WriteLine(string)
        ret
    }
}

result - failure :

D:\>ilasm /nologo /quiet app_fail.il
app_fail.il(23) : error : syntax error at token '' in:

***** FAILURE *****
D:\>

Error:
*ILAsm.exe incorrectly considers the text "/*", which appears within the string literal, to be an active block comment opening sequence.*

This only occurs within disabled preprocessor blocks. If the /* portion of the text is removed from the failure example, or if the (incorrectly-opened) block comment is "properly" closed via a */ sequence within the disabled section (this can be either within the string literal or subsequently), then the error goes away and the ILAsm.exe command succeeds.

Expected:
The happenstance contents of a string literal should not affect the success or failure of an ILAsm.exe invocation.

Notes:
Within preprocessor blocks that are enabled (active), string literals containing the comment sequence are handled correctly (not interpreted).

The same failure also occurs when the /* sequence appears outside of a string literal within a disabled preprocessor block, but this condition seems less obviously erroneous.

RussKeldorph commented 5 years ago

Thanks. Very nice analysis. I assume it's been that way forever, so not a high priority to fix, but we'll probably accept a PR for it.

glenn-slayden commented 5 years ago

ILAsm currently cannot be built by outsiders due to a dependency on a Microsoft-internal tool "VCBU Internal YACC," so this might make PR more difficult...

RussKeldorph commented 5 years ago

@glenn-slayden dotnet/runtime#4776 has some info on using btyacc to prototype a fix. If the PR is otherwise correct, we'd be happy to regenerate the yacc output using the internal tool if we have to. Obviously, we should figure out how to fix this yacc dependency, but it's not high on the priority list, unfortunately.

glenn-slayden commented 5 years ago

Fixing this may be subtle. The current ILAsm behavior is that block-style comments /* ... */ generally trump everything (except single line comment // ...). For the particular case here, /* #if... */ will always vacate the preprocessor directive (regardless of, i.e., sequential order-of-appearance between the two, meaning #if /* ... #endif */ leaves the preprocessor dangling). Though perhaps not entirely ideal, this greedy design is at least defensible.

But I digress. As described in the issue, what's unexpected is that string literals are not interpreted at all when /* ... is searching for its closing */, and that seems patently wrong.

The caution I wanted to add was that there's no requirement that any appearance of the double-quote character within a block-comment be delimiting some well-formed string literal. For example, the following currently works fine, a behavior which probably shouldn't be changed:

ldstr /* "incomplete string */ "foo bar"        // ok

So for fixing this issue: Searching for the closing */ within a (formative) block comment should only exclude (skip over) the contents of string literals which can be deemed "well-formed" according to some appropriately weak notion.

Other examples:

  1. /* and */ are evaluated in disabled #ifdef blocks (the 2nd #endif is required in the following):

    ldarg #ifdef EXCLUDE / #endif / garbage #endif the_arg // ok --> 'ldarg the_arg'

  2. #if... directives are never evaluated within /* ... */ (i.e., don't need any #endif in the following):

    ldarg / #ifdef EXCLUDE / the_arg // ok

  3. The problem described above does not occur with single-line comments. All of the following are ok:

    ldc.i4.0  // "foo /* foo"        // ok
    ldc.i4.0  // foo /* foo          // ok
    ldc.i4.0  // foo /* "foo         // ok
    
    #ifdef __EXCLUDE__
    ldc.i4.0  // "foo /* foo"        // ok (excluded)
    ldc.i4.0  // foo /* foo          // ok (excluded)
    ldc.i4.0  // foo /* "foo         // ok (excluded)
    #endif
  4. So single-line comment is the only way to hide /* (no block comment is opened in the following):

    ldarg the_arg // foo /* ok

  5. But, block comment can hide // (closing */ is effective in the following):

    ldarg /* foo // */
    the_arg                    // ok
    
    //  ...and not...
    
    ldarg /* foo // */         // NO
    */ t                       // excess block-comment close

Hints of "rock-paper-scissors?" Points (4) and (5) show that single-line comments have the same precedence as block comments. Only between these two is the relative priority established by forward processing according to sequential order of appearance in the source file.

glenn-slayden commented 5 years ago

Obviously, we should figure out how to fix this yacc dependency, but it's not high on the priority list, unfortunately.

@RussKeldorph Actually, I'd say that IL(D)Asm are both long overdue for full modernization / re-implementation from scratch in C#. Is that a PR you'd accept?

RussKeldorph commented 5 years ago

@glenn-slayden I think so. We've certainly talked about it for years. As long as it came with sufficient tests to verify near-conformance (possibly even bug-for-bug compat in some cases) with existing tools, we'd love to get rid of this unnecessarily native baggage. It would make it more convenient to integrate these tools into the SDK, and it would probably be easier for repos like Roslyn that use ILAsm as a validation tool.

FYI @jaredpar says https://github.com/mono/mono/tree/master/mcs/ilasm is pretty far from full conformance with ilasm corner cases, but it might be a place to start. If that became conformant, I assume it would become the "one true" ilasm and we could delete the source entirely from this repo.

/cc: @jkotas in case I'm too far into wishful thinking

glenn-slayden commented 5 years ago

After doing some preliminary investigations on this, it occurs to me that perhaps F# rather than C# might be a more elegant choice for implementing an (e.g.,) LALR parser such as (e.g.,) ILAsm, perhaps because the functional programming approach itself more closely models the declarative form of a BNF--or, more generally, any "formal"--language grammar?

As with the .NET ILAsm, the mono version that @jaredpar pointed to is fundamentally Yacc-based. I cleaned up that C#-based IL assembler source quite a bit for vs2017 and it builds, runs, and then builds PEs that work fine on .NET Framework 4.7.2, at least for the simple tests I've tried so far. The snapshot I linked consolidates all minimal dependencies; in particular it includes the complete/updated/working C sources and vcxproj for the jay.exe tool which generates the C# parser source file ILParser.jay.cs from the declarative grammar definition file ILParser.jay.

Although that code will be a useful reference, I'm not sure I will be able to fully use it as a starting point for what I'm working on, where I'm hoping to build and maintain a mutable Abtract Semantic Graph that's bidirectionally associated with the "live" IL source. That goal appears to need fundamentally different data structures and design, as compared to the traditional grammar approach which focuses on the (very different) one-time, one-way parsing scenario, and around which the Yacc tools and traditions have evolved.

briansull commented 4 years ago

Fixing this would require rewriting the assembler in a new language other that MS yacc/lex.

TIHan commented 1 year ago

Using a different YACC tool, as mentioned from https://github.com/dotnet/runtime/issues/4776 , we might be able to address this more easily depending on how flexible the tool is.