Closed AdamSpeight2008 closed 9 years ago
Implementation 2 This implementation uses recursion.
Function NextAre( at As Integer , text As String ) As Boolean
Return NextAre(at,1,text)
End Function
Function NextAre( at As Integer, start As Integer, text As String ) As Boolean
If ( text Is Nothing ) OrElse ( Not CanGetCharAt(at+word.Length) ) Then Return False
Return _NextAre( at + start , 0, text )
End Function
Function _NextAre( i As Integer, j As Integer, text As String) As Boolean
Return ( j = text.Length ) OrElse ( PeekAheadChar( i ) = Text( j ) AndAlso _NexAret( i+1, j + 1, text ) )
End Function
@AdamSpeight2008 What is the performance impact of your suggested change?
@mikedn (re: #943) Large look ahead cases like the one you link to would probably be better served by something like PeekAheadString(3, "CDATA[") , a single method call instead of 7.
Are you quoting me as an answer to the performance impact question? How so, I didn't make any definite statements about performance impact, I said "probably".
Besides, in my comment I specifically mentioned the large number of calls required by that code. Your proposed implementation doesn't do anything to address that, you still have 7 calls and on top of those you have additional code that wasn't there in the original implementation. As for the recursive version, no comment...
I'm getting the impression that you're trying to contribute something to this repository at any cost, including wasting people's time.
@mikedn
If Array
had a method that allowed you compare a sub-section of an array with another array.
AreNext( text : String ) : bool
{
return Array.Compare( sourceArray , offset, text);
}
Then that 7+ calls could a lot smaller.
That recursive function you just dismiss of hand. It is tail-recursive so can easily be expanded (by the compiler) then simplified. Since it has relatively simple known bounds.
private _NextAre( x : Int, i : int, j : int, s : string ) : bool
{
return (x==j) || Peek[ i ] == s[j] && _NextAre(x,i+1,j+1,s);
}
Expanded (parenthesis likely unbalanced.)
private _NextAre( x:=8, i:=2 , j:= 0, s:="CDATA[") : bool
{
return (8==2) || ( Peek[2] == s[0] &&
( (8==3) || ( Peek[3] == s[1] &&
( (8==4) || ( Peek[4] == s[2] &&
( (8==5) || ( Peek[5] == s[3] &&
( (8==6) || ( Peek[6] == s[4] &&
( (8==7) || ( Peek[7] == s[5] &&
( (8==8) )
)) )) )) )) ));
}
Simplify
private _NextAre( x:=8, i:=2 , j:= 0, s:="CDATA[") : bool
{
return false || ( Peek[2] == s[0] &&
( false || ( Peek[3] == s[1] &&
( false || ( Peek[4] == s[2] &&
( false || ( Peek[5] == s[3] &&
( false || ( Peek[6] == s[4] &&
( false || ( Peek[7] == s[5] &&
( true )
)) )) )) )) )) ));
}
Simplify more.
private _NextAre( x:=8, i:=2 , j:= 0, s:="CDATA[") : bool
{
return Peek[2] == s[0] && Peek[3] == s[1] &&
Peek[4] == s[2] && Peek[5] == s[3] &&
Peek[6] == s[4] && Peek[7] == s[5] ;
}
Another potential simplification, in which the compiler has decide to create a specialised function.
<CompilerGenerated()>
private _NextAre_6_() : bool
{
return Peek[2] == "C"c && Peek[3] == "D"c &&
Peek[4] == "A"c && Peek[5] == "T"c &&
Peek[6] == "A"c && Peek[7] == "["c ;
}
The compiler then could also then decide to inline the contents of the function.
You still get the benefits (plus more) but compiler does the donkey work and write the implementation.
I'm getting the impression that you're trying to contribute something to this repository at any cost, including wasting people's time.
Yes, I would like to contribute but I haven't had the benefits of working with the Roslyn source and design notes since inception.To me I see potential "improvements" whether those "improvements" are applicable to the design goals of the Roslyn project, is a different matter and for the team to decided. The Roslyn Team and I may have different goals.
Also I don't think it is your place to destroy someone spirit of exploration and discovery, simply because you think you know better. I'm going to break the build yes, but at least I'll learn stuff, and I may find an "improvement" that is applicable. I've already found a one.
Case "a"c To "z"c
should be be expanded out, so the compiler builds a select
rather than if
@AdamSpeight2008 Thanks for the suggestion. We welcome suggestions from anyone whether experienced or novice.
I like the NextAre/PeekAheadString idea and I agree that it would simplify plenty of callsites in the VB scanner. Performance aside, it's easier to read, easier to comprehend and easier to check for correctness, so it scores well for code maintenance. Performance-wise, in most places it's not going to be on critical path since the scanner goes one char at a time and only does the 'peek ahead string' once a common prefix has been determined. However, there's probably a second-order benefit: By replacing the sequence of char-by-char comparisons with a single method call ('outlining'), the IL and, indirectly, the generated code has been simplified and shortened, making the critical path code shorter and that means fewer L1 cache misses. This will be hard to measure except in dedicated micro-benchmarks for the VB parser (we have one, but I doubt it's sensitive enough for this change.)
I don't like the recursive implementation. Roslyn, at present, never emits the tail recursion IL instruction, and I don't think the JIT will make the transform either. I don't think using recursion adds anything to readability either.
@pharring Does the VB compiler have to be strictly 100% VB? If not we could use C#'s unsafe
friend bool SubstringMatch ( Char[] cs , String s, Int32 i )
{
if (cs == null || s == null) throw new ArgumentNullException();
if (i < 0 || i > cs.Length ) throw new ArgumentException("i");
if (cs.Length - i != s.Length) return false;
unsafe
{
fixed (char* ps_ = s, pcs_ = cs) {
{
char* pcs = pcs_ + i;
char* end = pcs + s.Length;
char* ps = ps_;
while (pcs != end)
if (*pcs++ != *ps++) return false;
}
}
return true;
}
For performance it may require additional support from the underlying platform. Eg some native implementation.
On the tail-recursive functoin expansion, it's the compiler itself the does the expanding not the JIT. In C++ this would have been implemented via a template, but we don't have templates.
The VB compiler can, and does, call into helper methods in the core compiler assembly. Frankly, though, I wouldn't reach for unsafe for this routine unless I had performance data to show it was a bottleneck. This inner loop is likely to be just as fast for short strings:
For Each ch As Char In s
If ch <> cs(i) Then Return False
i = i + 1
Next
Return True
I was thinking about the following implementation.
Function AreNext( offset As Integer, chars As String ) As Boolean
If chars Is Nothing Then Return False
Dim n = chars.Length
If Not Not CanGetCharAt( offset + n ) Then Return False
Dim i = 0
While (i<n) AndAlso ( Peek(offset+i)) == chars(i) )
i += 1
End While
Return (i == n)
End Function
A possible concern is excessive object creation (eg the string) triggering a GC,
A few 'nits', then a suggestion (I hope you don't mind that this turned out to be long. I used it as an opportunity to 'teach')
If chars is Nothing
check with Debug.Assert(Not String.IsNullOrEmpty(chars))
. This is a private routine and no caller should be passing Nothing or an empty string.Not
on line 4.PeekAheadChar
instead of Peek
For extra points: Observe:
If you do early return from within the loop body then you can avoid checking whether the loop ran to completion at the end.
i.e. Instead of doing the AndAlso terminatingExpression
in the While
clause, invert the terminating expression and put it in the loop body with a Return False
.
Then, it should be obvious that the loop becomes a For ... Next
and you don't need the Return (i==n)
at the end.
Function AreNext(offset As Integer, chars As String) As Boolean
Debug.Assert(Not String.IsNullOrEmpty(chars))
Dim n = chars.Length
If Not CanGetCharAtOffset(offset + n) Then Return False
For i = 0 To n - 1
If PeekAheadChar(offset+i) <> chars(i) Then Return False
Next
Return True
End Function
Finally, observe the offset + i
calculation in the loop body. We can do better by 'accumulating' offset as we go. We no longer need the local loop variable because we can use For Each
. We can then also eliminate the local n
since it's used exactly once. i.e.
Function AreNext(offset As Integer, chars As String) As Boolean
Debug.Assert(Not String.IsNullOrEmpty(chars))
If Not CanGetCharAtOffset(offset + chars.Length) Then Return False
For Each ch in chars
If PeekAheadChar(offset) <> ch Then Return False
offset += 1
Next
Return True
End Function
That's about as far as we can go without digging into the implementations of CanGetCharAtOffset
and PeekAheadChar
and hoisting its implementation here which will open up further optimizations.
Now, performance engineers around the world may cry: "premature optimization blah blah...". Indeed, I didn't measure anything before refactoring the code. Shame on me. Experience tells me that reducing "compare and branch" operations will help execution time (branch mispredictions, pipeline stalls). Eliminating the locals, reduces cognitive load on the reader and frees up register slots for the JIT compiler (on the other hand, I will sometimes introduce locals for 'common subexpression elimination' i.e. if a property is accessed several times without chance of mutation). Perhaps the only 'tricky' thing is mutating one of the function's arguments - that replaced one addition with another and eliminated a loop variable which, no doubt, is actually just 'hidden' within the string's iterator. I claim that the refactoring didn't do anything "unnatural" to the code and it's just as easy to read and maintain as the original. For me, these kinds of refactorings are now just a habit (along with reducing allocations).
Don't worry about the string allocations. These are going to be compile time constant (string tokens in the assembly, created with the ldstr
IL instruction) and there's a fast path for loading and interning them.
@pharring
I don't mind at all, the rationale beyond one's design can be a such a good learning tool.
Those 'nits' or more like 'tics' I easily miss. When I'm just ruffing out the code I tend to use a mixture of programming languages.
Renaming PeekAheadChar
to just Peek
does makes the code much easier to read.
This is implementation has an IL size of 55.
Function NextAre(offset As Integer, chars As String) As Boolean
Debug.Assert(Not String.IsNullOrEmpty(chars))
Dim n = chars.Length
If Not CanGetCharAtOffset(offset+n) Then Return False
Dim i = -1
Do
i+=1
Loop While (i<n) AndAlso (PeekAheadChar(offset+i) = Chars(i))
Return i=n
End Function
IL size 49
Function AreNext( offset As Integer, chars As String ) As Boolean
Debug.Assert(Not String.IsNullOrEmpty(chars))
Dim n = chars.Length
Dim i = -1
If CanGetCharAtOffset( offset + n ) Then
Do
i+=1
Loop While i<n AndAlso PeekAheadChar(offset+i) = chars(i)
end if
Return i = n
End Function
Building and Unit Testing an implementation.
Sorry I didn't comment before you did the pull request, but I don't think minimal IL size is a productive endeavor (except in extreme cases). In order to do that, you introduced an additional comparison at the end and the peculiar -1 initialization and a Do..While. IMO, it's harder for a casual reader to check for correctness.
I'd recommend the For... Next version I posted above.
In the scanner there are multiple occasions where we compare the results of multipe calls to
PeekAheadChar
against the characters of a desired sequence of characters. In essences do the following character match this string. ExampleIt would make the code much simpler if this was abstracted in to a function. I propose the functon name of NextAre as it scans good with current conditionals (of If / When)
Implementation
Note:
start + text.Length = at
seems to hold true, which if that is the case then the parameters of the function can be reduced to two.( start As Integer, text As String ) As Boolean
So is there any existing use case (in roslyn) where this isn't the case.