dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.96k stars 4.03k forks source link

(Proposal) Make Multiple PeekAheadChar( ) Comparisions simpler #954

Closed AdamSpeight2008 closed 9 years ago

AdamSpeight2008 commented 9 years ago

In the scanner there are multiple occasions where we compare the results of multipe calls to PeekAheadChar against the characters of a desired sequence of characters. In essences do the following character match this string. Example

If CanGetCharAtOffset(length + 8) AndAlso 
    PeekAheadChar(length + 2) = "C"c AndAlso
    PeekAheadChar(length + 3) = "D"c AndAlso 
    PeekAheadChar(length + 4) = "A"c AndAlso 
    PeekAheadChar(length + 5) = "T"c AndAlso 
    PeekAheadChar(length + 6) = "A"c AndAlso 
    PeekAheadChar(length + 7) = "["c Then 
        Return XmlMakeBeginCDataToken(precedingTrivia, scanTrailingTrivia) 
End If 

It would make the code much simpler if this was abstracted in to a function. I propose the functon name of NextAre as it scans good with current conditionals (of If / When)

If NextAre(8,2,"CDATA[") Then Return XmlMakeBeginCDataToken(precedingTrivia, scanTrailingTrivia)
Case "["c When NextAre(8,2,"CDATA[")  :  Return XmlMakeBeginCDataToken(precedingTrivia, scanTrailingTrivia)

Implementation

Function NextAre( at As Integer , text As String ) As Boolean
  Return NextAre( at,1,text)
End Function

Function NextAre( at As Integer, start As Integer, text As String ) As Boolean
  If text Is Nothing Then Return False
  Dim n = word.length
  If Not CanGetCharAtOffset( at+n ) Then Return False
  Dim startAt = at + start
  Dim i=0
  While i < n
    If PeekAheadChar( startAt+i ) <> text( i ) Then Return False
    i+=1
  End While
  Return True
End Function

Note: start + text.Length = at seems to hold true, which if that is the case then the parameters of the function can be reduced to two. ( start As Integer, text As String ) As Boolean So is there any existing use case (in roslyn) where this isn't the case.

AdamSpeight2008 commented 9 years ago

Implementation 2 This implementation uses recursion.

Function NextAre( at As Integer , text As String ) As Boolean
  Return NextAre(at,1,text)
End Function

Function NextAre( at As Integer, start As Integer, text As String ) As Boolean
  If ( text Is Nothing ) OrElse ( Not CanGetCharAt(at+word.Length) ) Then Return False
  Return _NextAre( at + start , 0, text )
End Function

Function _NextAre( i As Integer, j As Integer, text As String) As Boolean
  Return ( j = text.Length ) OrElse ( PeekAheadChar( i ) = Text( j ) AndAlso _NexAret( i+1, j + 1, text ) )
End Function
gafter commented 9 years ago

@AdamSpeight2008 What is the performance impact of your suggested change?

AdamSpeight2008 commented 9 years ago

@mikedn (re: #943) Large look ahead cases like the one you link to would probably be better served by something like PeekAheadString(3, "CDATA[") , a single method call instead of 7.

mikedn commented 9 years ago

Are you quoting me as an answer to the performance impact question? How so, I didn't make any definite statements about performance impact, I said "probably".

Besides, in my comment I specifically mentioned the large number of calls required by that code. Your proposed implementation doesn't do anything to address that, you still have 7 calls and on top of those you have additional code that wasn't there in the original implementation. As for the recursive version, no comment...

I'm getting the impression that you're trying to contribute something to this repository at any cost, including wasting people's time.

AdamSpeight2008 commented 9 years ago

@mikedn

If Array had a method that allowed you compare a sub-section of an array with another array.

AreNext( text : String ) : bool
{
  return Array.Compare( sourceArray , offset, text);
}

Then that 7+ calls could a lot smaller.


That recursive function you just dismiss of hand. It is tail-recursive so can easily be expanded (by the compiler) then simplified. Since it has relatively simple known bounds.

private _NextAre( x : Int, i : int, j : int, s : string ) : bool
{
  return (x==j) || Peek[ i ] == s[j] && _NextAre(x,i+1,j+1,s);
}

Expanded (parenthesis likely unbalanced.)

private _NextAre( x:=8, i:=2 , j:= 0, s:="CDATA[") : bool
{
  return (8==2) || ( Peek[2] == s[0] && 
       ( (8==3) || ( Peek[3] == s[1] && 
       ( (8==4) || ( Peek[4] == s[2] &&
       ( (8==5) || ( Peek[5] == s[3] &&
       ( (8==6) || ( Peek[6] == s[4] &&
       ( (8==7) || ( Peek[7] == s[5] &&
       ( (8==8) )
       )) )) )) )) ));               
}

Simplify

private _NextAre( x:=8, i:=2 , j:= 0, s:="CDATA[") : bool
{
  return false || ( Peek[2] == s[0] && 
       ( false || ( Peek[3] == s[1] && 
       ( false || ( Peek[4] == s[2] &&
       ( false || ( Peek[5] == s[3] &&
       ( false || ( Peek[6] == s[4] &&
       ( false || ( Peek[7] == s[5] &&
       ( true )
       )) )) )) )) )) ));               
}

Simplify more.

private _NextAre( x:=8, i:=2 , j:= 0, s:="CDATA[") : bool
{
  return  Peek[2] == s[0] && Peek[3] == s[1] && 
          Peek[4] == s[2] && Peek[5] == s[3] &&
          Peek[6] == s[4] && Peek[7] == s[5] ;               
}

Another potential simplification, in which the compiler has decide to create a specialised function.

<CompilerGenerated()>
private _NextAre_6_() : bool
{
  return  Peek[2] == "C"c && Peek[3] == "D"c && 
          Peek[4] == "A"c && Peek[5] == "T"c &&
          Peek[6] == "A"c && Peek[7] == "["c ;               
}

The compiler then could also then decide to inline the contents of the function.

You still get the benefits (plus more) but compiler does the donkey work and write the implementation.


I'm getting the impression that you're trying to contribute something to this repository at any cost, including wasting people's time.

Yes, I would like to contribute but I haven't had the benefits of working with the Roslyn source and design notes since inception.To me I see potential "improvements" whether those "improvements" are applicable to the design goals of the Roslyn project, is a different matter and for the team to decided. The Roslyn Team and I may have different goals.

Also I don't think it is your place to destroy someone spirit of exploration and discovery, simply because you think you know better. I'm going to break the build yes, but at least I'll learn stuff, and I may find an "improvement" that is applicable. I've already found a one. Case "a"c To "z"c should be be expanded out, so the compiler builds a select rather than if

pharring commented 9 years ago

@AdamSpeight2008 Thanks for the suggestion. We welcome suggestions from anyone whether experienced or novice.

I like the NextAre/PeekAheadString idea and I agree that it would simplify plenty of callsites in the VB scanner. Performance aside, it's easier to read, easier to comprehend and easier to check for correctness, so it scores well for code maintenance. Performance-wise, in most places it's not going to be on critical path since the scanner goes one char at a time and only does the 'peek ahead string' once a common prefix has been determined. However, there's probably a second-order benefit: By replacing the sequence of char-by-char comparisons with a single method call ('outlining'), the IL and, indirectly, the generated code has been simplified and shortened, making the critical path code shorter and that means fewer L1 cache misses. This will be hard to measure except in dedicated micro-benchmarks for the VB parser (we have one, but I doubt it's sensitive enough for this change.)

I don't like the recursive implementation. Roslyn, at present, never emits the tail recursion IL instruction, and I don't think the JIT will make the transform either. I don't think using recursion adds anything to readability either.

AdamSpeight2008 commented 9 years ago

@pharring Does the VB compiler have to be strictly 100% VB? If not we could use C#'s unsafe

friend bool SubstringMatch ( Char[] cs , String s, Int32 i )
{
  if (cs == null || s == null) throw new ArgumentNullException();
  if (i < 0 || i > cs.Length ) throw new ArgumentException("i");
  if (cs.Length - i != s.Length) return false;
  unsafe
  {
    fixed (char* ps_ = s, pcs_ = cs) {
    {
      char* pcs = pcs_ + i;
      char* end = pcs + s.Length;
      char* ps = ps_;
      while (pcs != end) 
        if (*pcs++ != *ps++) return false;    
    }
  }
  return true;
}

For performance it may require additional support from the underlying platform. Eg some native implementation.


On the tail-recursive functoin expansion, it's the compiler itself the does the expanding not the JIT. In C++ this would have been implemented via a template, but we don't have templates.

pharring commented 9 years ago

The VB compiler can, and does, call into helper methods in the core compiler assembly. Frankly, though, I wouldn't reach for unsafe for this routine unless I had performance data to show it was a bottleneck. This inner loop is likely to be just as fast for short strings:

    For Each ch As Char In s
        If ch <> cs(i) Then Return False
        i = i + 1
    Next

    Return True
AdamSpeight2008 commented 9 years ago

I was thinking about the following implementation.

Function AreNext( offset As Integer, chars As String ) As Boolean
  If chars Is Nothing Then Return False
  Dim n = chars.Length
  If Not Not CanGetCharAt( offset + n ) Then Return False
  Dim i = 0
  While (i<n) AndAlso ( Peek(offset+i)) == chars(i) )
    i += 1
  End While
  Return (i == n)
End Function

A possible concern is excessive object creation (eg the string) triggering a GC,

pharring commented 9 years ago

A few 'nits', then a suggestion (I hope you don't mind that this turned out to be long. I used it as an opportunity to 'teach')

  1. I'd probably replace the If chars is Nothing check with Debug.Assert(Not String.IsNullOrEmpty(chars)). This is a private routine and no caller should be passing Nothing or an empty string.
  2. There's an extra Not on line 4.
  3. Nit: You have 'double equals' on line 6. VB's equality comparison is a single equals.
  4. I presume you meant PeekAheadChar instead of Peek

For extra points: Observe: If you do early return from within the loop body then you can avoid checking whether the loop ran to completion at the end. i.e. Instead of doing the AndAlso terminatingExpression in the While clause, invert the terminating expression and put it in the loop body with a Return False. Then, it should be obvious that the loop becomes a For ... Next and you don't need the Return (i==n) at the end.

Function AreNext(offset As Integer, chars As String) As Boolean
  Debug.Assert(Not String.IsNullOrEmpty(chars))
  Dim n = chars.Length
  If Not CanGetCharAtOffset(offset + n) Then Return False
  For i = 0 To n - 1
      If PeekAheadChar(offset+i) <> chars(i) Then Return False
  Next
  Return True
End Function

Finally, observe the offset + i calculation in the loop body. We can do better by 'accumulating' offset as we go. We no longer need the local loop variable because we can use For Each. We can then also eliminate the local n since it's used exactly once. i.e.

Function AreNext(offset As Integer, chars As String) As Boolean
  Debug.Assert(Not String.IsNullOrEmpty(chars))
  If Not CanGetCharAtOffset(offset + chars.Length) Then Return False
  For Each ch in chars
      If PeekAheadChar(offset) <> ch Then Return False
      offset += 1
  Next
  Return True
End Function

That's about as far as we can go without digging into the implementations of CanGetCharAtOffset and PeekAheadChar and hoisting its implementation here which will open up further optimizations.

Now, performance engineers around the world may cry: "premature optimization blah blah...". Indeed, I didn't measure anything before refactoring the code. Shame on me. Experience tells me that reducing "compare and branch" operations will help execution time (branch mispredictions, pipeline stalls). Eliminating the locals, reduces cognitive load on the reader and frees up register slots for the JIT compiler (on the other hand, I will sometimes introduce locals for 'common subexpression elimination' i.e. if a property is accessed several times without chance of mutation). Perhaps the only 'tricky' thing is mutating one of the function's arguments - that replaced one addition with another and eliminated a loop variable which, no doubt, is actually just 'hidden' within the string's iterator. I claim that the refactoring didn't do anything "unnatural" to the code and it's just as easy to read and maintain as the original. For me, these kinds of refactorings are now just a habit (along with reducing allocations).

Don't worry about the string allocations. These are going to be compile time constant (string tokens in the assembly, created with the ldstr IL instruction) and there's a fast path for loading and interning them.

AdamSpeight2008 commented 9 years ago

@pharring I don't mind at all, the rationale beyond one's design can be a such a good learning tool. Those 'nits' or more like 'tics' I easily miss. When I'm just ruffing out the code I tend to use a mixture of programming languages. Renaming PeekAheadChar to just Peek does makes the code much easier to read.

This is implementation has an IL size of 55.

Function NextAre(offset As Integer, chars As String) As Boolean
  Debug.Assert(Not String.IsNullOrEmpty(chars))
  Dim  n = chars.Length
  If Not CanGetCharAtOffset(offset+n) Then Return False
  Dim i = -1
  Do 
    i+=1
  Loop While (i<n) AndAlso (PeekAheadChar(offset+i) = Chars(i))
  Return i=n
End Function

IL size 49

Function AreNext( offset As Integer, chars As String ) As Boolean
  Debug.Assert(Not String.IsNullOrEmpty(chars))
  Dim n = chars.Length
  Dim i = -1      
  If CanGetCharAtOffset( offset + n ) Then 
  Do
    i+=1
  Loop While i<n AndAlso  PeekAheadChar(offset+i) = chars(i)
  end if
  Return i = n
End Function
AdamSpeight2008 commented 9 years ago

Building and Unit Testing an implementation.

pharring commented 9 years ago

Sorry I didn't comment before you did the pull request, but I don't think minimal IL size is a productive endeavor (except in extreme cases). In order to do that, you introduced an additional comparison at the end and the peculiar -1 initialization and a Do..While. IMO, it's harder for a casual reader to check for correctness.

I'd recommend the For... Next version I posted above.