Closed gfoidl closed 3 days ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Does the ASP.NET code measurably regress if the newly-added ToUtf16 is used plus a call to IndexOf((byte)'\0) to validate there wasn't a null?
private static unsafe void GetHeaderName(ReadOnlySpan<byte> source, Span<char> buffer)
{
OperationStatus status = Ascii.ToUtf16(source, buffer, out _, out _);
if (status != OperationStatus.Done || source.IndexOf((byte)'\0') >= 0)
{
KestrelBadHttpRequestException.Throw(RequestRejectionReason.InvalidCharactersInHeaderName);
}
}
@GrabYourPitchforks had some fairly strong opinions about special-casing '\0'
.
In local micro-benchmarks yes, mainly due to the $O(2n)$-nature. And of course it depends on the input (length and position of \0
).
It would be more interesting to see real-usage benchmarks, like how that would impact Techempower, etc. But unfortunately I don't know how to run such benchmarks (at the moment).
fairly strong opinions about special-casing
\0
I'm looking forward to read them.
Tagging subscribers to this area: @dotnet/area-system-buffers See info in area-owners.md if you want to be subscribed.
Author: | gfoidl |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Buffers`, `untriaged` |
Milestone: | - |
Does the ASP.NET code measurably regress if the newly-added ToUtf16 is used plus a call to IndexOf((byte)'\0) to validate there wasn't a null?
private static unsafe void GetHeaderName(ReadOnlySpan<byte> source, Span<char> buffer) { OperationStatus status = Ascii.ToUtf16(source, buffer, out _, out _); if (status != OperationStatus.Done || source.IndexOf((byte)'\0') >= 0) { KestrelBadHttpRequestException.Throw(RequestRejectionReason.InvalidCharactersInHeaderName); } }
@GrabYourPitchforks had some fairly strong opinions about special-casing
'\0'
.
99% of the time there won't be any nulls but headers average 800 bytes to 2kB with cookies; so scanning the headers an additional time to check for nulls can be significant
headers average 800 bytes to 2kB with cookies
Thanks for these numbers! Out of interest: how / where did you get these from?
headers average 800 bytes to 2kB with cookies
Thanks for these numbers! Out of interest: how / where did you get these from?
Googled average headers size π
As an anecdote going to Google homepage logged in my headers are 2158 bytes and it makes 27 requests to that domain (26 to other domains); so in total 58kB for that page and one domain
Wouldn't it be better to treat \0
as special at the highest level, instead of at the lowest level?
For example, I think that when parsing HTTP 1 headers, you could look for \r
, \n
or \0
as the first step, instead of just \r
and \n
, and deal with \0
at that point. That would then mean you could safely use the current version of Ascii.ToUtf16
to convert the header bytes to UTF-16.
Assigning to @GrabYourPitchforks for now until the necessary input can be given.
I am strongly against this proposal. ASCII is defined as characters in the range 0x00 .. 0x7F
, inclusive. Sometimes a protocol will exclude certain characters (0x00
, or the entire control character range 0x00 .. 0x1F
and 0x7F
), but at that point you're making something tied to a particular protocol rather than something that is a general-purpose ASCII API. It's similar to the reason we don't support WTF-8 within any of our UTF-8 APIs: certain protocols may utilize it, but it doesn't belong in a general-purpose UTF-8 processing API.
Since this is protocol-specific for aspnet, I recommend the code remain in that project.
It would be more interesting to see real-usage benchmarks, like how that would impact Techempower, etc.
@sebastienros Is there a way to measure real-world impact for this API? I've laid out above my arguments against doing this - namely, that protocol-level concerns don't belong in a general-purpose API. But if there is strongly compelling evidence that this is a real perf bottleneck and the runtime layer is the only layer that can provide this functionality properly, that should be weighed in favor of this API, even against my concerns.
I will need to check what can be impacted in ASP.NET and see what benchmarks would exercise this code path. If someone knows which scenarios are useful here then I can start it.
Do we know why ASP.NET special-cases \0 here? What happens if we just stop doing that? If ASP.NET needs to do that, is it likely that others will similarly need to special-case certain values?
I'd really like to be able to encapsulate this in a core library provided helper, for ASP.NET to use and for others to use. Vectorizing such a thing is very non-trivial. Is there a shape of an API we could come up with that would enable efficiently doing this, e.g. a default overload that is for [0, 127] but another overload that lets you opt-out one or more values or ranges, or some such thing?
Note that one of the primary uses of IndexOfAnyValues is for use in protocols, where protocols need to search for or exempt certain things. Could/should we incorporate that somehow?
Do we know why ASP.NET special-cases \0 here?
Spec wise it shouldn't; however if its a front-end server that passes requests to another server; if and that server uses null terminated strings then the request can change in the second layer accessing url's which weren't mapped to the internet, which could be a security risk (along lines of https://en.wikipedia.org/wiki/HTTP_request_smuggling though different)
Note that one of the primary uses of IndexOfAnyValues is for use in protocols, where protocols need to search for or exempt certain things. Could/should we incorporate that somehow?
As @svick says, it just needs to check for 3 rather than 2, then throw if its \0
; is already an api for it
For example, I think that when parsing HTTP 1 headers, you could look for \r, \n or \0 as the first step, instead of just \r and \n, and deal with \0 at that point. That would then mean you could safely use the current version of Ascii.ToUtf16 to convert the header bytes to UTF-16.
As @svick says, it just needs to check for 3 rather than 2, then throw if its \0; is already an api for it
I hadn't seen @svick's comment:
For example, I think that when parsing HTTP 1 headers, you could look for \r, \n or \0 as the first step, instead of just \r and \n, and deal with \0 at that point. That would then mean you could safely use the current version of Ascii.ToUtf16 to convert the header bytes to UTF-16.
Is that viable? Can all of the places that call into this shared routine be updated trivially to ensure the data passed in doesn't contain a \0? If so, let's do that, add the AScii.ToUtf16 that's for the whole [0, 127] range, update ASP.NET to use that, and call it a good day.
@BrennanConroy, do you have a suggestion for how we could make forward progress on this?
For example, I think that when parsing HTTP 1 headers, you could look for \r, \n or \0 as the first step, instead of just \r and \n, and deal with \0 at that point. That would then mean you could safely use the current version of Ascii.ToUtf16 to convert the header bytes to UTF-16.
Is that viable? Can all of the places that call into this shared routine be updated trivially to ensure the data passed in doesn't contain a \0? If so, let's do that, add the AScii.ToUtf16 that's for the whole [0, 127] range, update ASP.NET to use that, and call it a good day.
Looks like it would be "easy" to do this. If we updated https://github.com/dotnet/aspnetcore/blob/f62f12357c49c4f1cca502e8f4cf57353f0b320f/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#LL64C37-L64C37 to instead be
private static ReadOnlySpan<byte> Delimiters => new byte[] { ByteLF, 0 };
if (reader.TryReadToAny(out ReadOnlySpan<byte> requestLine, Delimiters, advancePastDelimiter: true))
{
if (requestLine.Length == 0 || (reader.TryPeek(out var next) && next == 0))
{
RejectRequestLine(requestLine);
}
ParseRequestLine(handler, requestLine);
return true;
}
This is where we start parsing the HTTP/1 request so we're already going through the entire request line looking for \n
, this just updates to TryReadToAny
to search for \0
at the same time, which I hope is optimized for single pass π.
The concerns are:
\0
check from the GetAsciiStringNonNullCharacters
method which means new callers could accidentally allow null\0
is found are lacking detail (although a lot faster π)GetAsciiStringNonNullCharacters
although it looked like they were supposed to already be null character checking before calling this methodLooks like it would be "easy" to do this
Thanks, I sketched it out in: https://github.com/dotnet/aspnetcore/compare/main...stephentoub:aspnetcore:asciitoutf16 Not exactly what you suggested, but similar.
A bunch of tests failed, and I haven't gone through to see which would be expected (the tests have internals access) and which would be real problems.
Any interest in picking it up and seeing how far we can run with it?
Yeah, I'll pick it up and see what the team thinks
It looks like Ascii.ToUtf16
is still slower than the custom code in aspnetcore.
https://github.com/dotnet/aspnetcore/issues/45962#issuecomment-1402255853
https://github.com/dotnet/runtime/issues/80245 is open which might be indirectly tracking part of the work to improve the performance. And there is a recent PR that might improve performance https://github.com/dotnet/runtime/pull/85266.
It looks like Ascii.ToUtf16 is still slower than the custom code in aspnetcore. https://github.com/dotnet/aspnetcore/issues/45962#issuecomment-1402255853
I'm not aware of any fundamental reason that should be the case. We should fix anything in the core routine that might be contributing those few additional cycles. I'd hope it's not just the difference between returning a bool and returning an OperationStatus.
cc: @adamsitnik, @GrabYourPitchforks
Grabbed the assembly of Ascii.ToUtf16
and Kestrel's TryGetAsciiString
One very obvious difference is that the core processing is not inlined in the Ascii.ToUtf16
case. That's the first thing I would try when comparing perf, but it always takes me a couple hours to figure out how to get a custom runtime again, so I haven't tried yet π
But if someone wants to take a look at the assembly in the meantime and see if there is anything obviously worse in the Ascii.ToUtf16
case please do!
Below are the results I'm getting on my machine. This seems very much within the range of noise.
@BrennanConroy Are you seeing different results than below?
BenchmarkDotNet v0.13.8, Windows 11 (10.0.22621.2283/22H2/2022Update/SunValley2) (Hyper-V)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.100-rc.1.23455.8
[Host] : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
Job=.NET 8.0 Runtime=.NET 8.0
Method | StringLength | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Ascii_ToUtf16 | 4 | 4.763 ns | 0.0336 ns | 0.0314 ns | 1.00 |
StringUtilities_TryGetAscii | 4 | 4.123 ns | 0.0335 ns | 0.0313 ns | 0.87 |
Ascii_ToUtf16 | 8 | 4.968 ns | 0.0201 ns | 0.0168 ns | 1.00 |
StringUtilities_TryGetAscii | 8 | 4.714 ns | 0.0307 ns | 0.0287 ns | 0.95 |
Ascii_ToUtf16 | 16 | 5.159 ns | 0.0405 ns | 0.0359 ns | 1.00 |
StringUtilities_TryGetAscii | 16 | 3.625 ns | 0.0387 ns | 0.0362 ns | 0.70 |
Ascii_ToUtf16 | 24 | 6.823 ns | 0.0681 ns | 0.0637 ns | 1.00 |
StringUtilities_TryGetAscii | 24 | 5.212 ns | 0.0336 ns | 0.0298 ns | 0.76 |
Ascii_ToUtf16 | 128 | 8.177 ns | 0.0519 ns | 0.0485 ns | 1.00 |
StringUtilities_TryGetAscii | 128 | 9.791 ns | 0.0186 ns | 0.0155 ns | 1.20 |
Ascii_ToUtf16 | 256 | 10.232 ns | 0.0517 ns | 0.0458 ns | 1.00 |
StringUtilities_TryGetAscii | 256 | 11.014 ns | 0.0491 ns | 0.0459 ns | 1.08 |
Ascii_ToUtf16 | 1024 | 28.757 ns | 0.1900 ns | 0.1777 ns | 1.00 |
StringUtilities_TryGetAscii | 1024 | 33.437 ns | 0.2068 ns | 0.1727 ns | 1.16 |
@BrennanConroy, can you comment on the above?
Closing given https://github.com/dotnet/aspnetcore/pull/56578
Background and motivation
For ASP.NET Core's
StringUtilities
the ASCII values of the range(0x00, 0x80)
are considered valid, whilstAscii.ToUtf16
treats the whole ASCII range[0x00, 0x80)
as valid. In order to baseStringUtilities
on the Ascii-APIs and avoid custom vectorized code in ASP.NET Core internals\0
should be allowed to be treated as invalid. See https://github.com/dotnet/aspnetcore/issues/45962 for further info.API Proposal
The new ASCII-APIs will get added to .NET 8, so w/o breaking change an optional argument could be added.
API Usage
Alternative Designs
No response
Risks
The value for
treatNullAsInvalid
will be given as constant, so the JIT should be able to dead-code eliminate any code needed for "default case" (whole ASCII-range incl.\0
), so no perf-regression should be expected.Besides treating
\0
as special value which is optinally treated as invalid I don't expect any other value to be considered special enough for optional exclusion.