Simplify newline search

buybackoff / 1brc

1BRC in .NET among fastest on Linux

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/

MIT License

437 stars 43 forks source link

Simplify newline search #1

Closed neon-sunset closed 8 months ago

neon-sunset commented 8 months ago

The newline scanner seems to include line endings and its code looks very similar to how CoreLib's line iterator is implemented. Because the only line endings that are encountered are either CRLF or LF, there is no need to scan for just CRs with .IndexOfAny which are not valid line endings on Windows or Unix systems.

This is a very quick change and I have not looked into other code in this solution, just something that stood out to me because I worked on a similar problem some time ago here: https://github.com/U8String/U8String/blob/main/Sources/Primitives/U8Enumerators.cs#L505-L536

buybackoff commented 8 months ago

This is not used on the hot path. And it will not work on Windows. See the comment in the Readme.

Java code produces \r\n so there is an implicit assumption about OS-native line endings.

The hot path just treats all chars <=13 as line endings...

neon-sunset commented 8 months ago

Current implementation matches either CR or LF and if CR is followed by LF it effectively advances by 1. The CR matching step is doing extra work the caller does not seem to care about - scanning for just LF will match on both CRLFs and LFs, and I don't see the loop accounting for the "holes" between segments (if it does - let me know!).

Either way, if this isn't on a hot path then it likely doesn't matter ¯_(ツ)_/¯

buybackoff commented 8 months ago

Either way, if this isn't on a hot path then it likely doesn't matter

Yeah, it did matter and I rewrote that part. I will (very likely) rewrite chunking logic and remove that.

But you are right that for boundaries we only need \n.

neon-sunset commented 8 months ago

Got it, thank you for your work. Hopefully it will bring more attention to our underappreciated language😄