The fastest way to html encode string to Utf8 string

adamsitnik commented 4 years ago

While profiling the Fortunes TechEmpower benchmark I've found out that we spend around 7% of the total CPU time on html encoding strings returned from DB and converting them to Utf8.

obraz

Here is the code that does that as of today:

https://github.com/aspnet/Benchmarks/blob/b7d05a5b17dd37354b62b2ecc3bcb942eaab4354/src/BenchmarksApps/Kestrel/PlatformBenchmarks/BenchmarkApplication.Fortunes.cs#L51

https://github.com/aspnet/Benchmarks/blob/b7d05a5b17dd37354b62b2ecc3bcb942eaab4354/src/BenchmarksApps/Kestrel/PlatformBenchmarks/BufferExtensions.cs#L23-L30

My initial thought was: why do we allocate a new string (encoder.Encode) and then check the number of bytes (Encoding.UTF8.GetByteCount) and covert to utf8 (Encoding.UTF8.GetBytes) while we could take advantage of the available memory buffer and do it in place?

So I've changed the implementation to do it in place by calling Encoding.UTF8.GetBytes first and then encoder.EncodeUtf8 but the perf has regressed (290k RPS to 280k RPS).

I am attaching code of the microbenchmark that uses input copied from the TechEmpower benchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System;
using System.Text;
using System.Text.Encodings.Web;
using System.Text.Unicode;

namespace Repro
{
    class Program
    {
        static void Main(string[] args) => BenchmarkRunner.Run<Benchmarks>(DefaultConfig.Instance.AddJob(Job.ShortRun));
    }

    public class Benchmarks
    {
        private readonly HtmlEncoder _encoder = CreateHtmlEncoder();
        private readonly string[] _inputs = new string[]
        {
            "<script>alert(\"This should not be displayed in a browser alert box.\");</script>",
            "A bad random number generator: 1, 1, 1, 1, 1, 4.33e+67, 1, 1, 1",
            "A computer program does what you tell it to do, not what you want it to do.",
            "A computer scientist is someone who fixes things that aren't broken.",
            "A list is only as strong as its weakest link. — Donald Knuth",
            "Additional fortune added at request time.",
            "After enough decimal places, nobody gives a damn.",
            "Any program that runs right is obsolete.",
            "Computers make very fast, very accurate mistakes.",
            "Emacs is a nice operating system, but I prefer UNIX. — Tom Christaensen",
            "Feature: A bug with seniority.",
            "fortune: No such file or directory",
            "フレームワークのベンチマーク"
        };
        private readonly byte[] _bytes = new byte[1024];

        [Benchmark(Baseline = true)]
        public int Current()
        {
            int sum = 0;
            HtmlEncoder encoder = _encoder;
            byte[] bytes = _bytes;
            foreach (string input in _inputs)
            {
                string encoded = encoder.Encode(input);
                int byteCount = Encoding.UTF8.GetByteCount(encoded);
                sum += Encoding.UTF8.GetBytes(encoded.AsSpan(), new Span<byte>(bytes, 0, byteCount));
            }
            return sum;
        }

        [Benchmark]
        public int InPlace()
        {
            int sum = 0;
            HtmlEncoder encoder = _encoder;
            byte[] bytes = _bytes;
            foreach (string input in _inputs)
            {
                Span<byte> secondHalf = new Span<byte>(bytes, bytes.Length / 2, bytes.Length / 2);
                int bytesCount = Encoding.UTF8.GetBytes(input.AsSpan(), secondHalf);
                encoder.EncodeUtf8(secondHalf.Slice(0, bytesCount), bytes, out _, out int bytesWritten, true);
                sum += bytesWritten;
            }
            return sum;
        }

        private static HtmlEncoder CreateHtmlEncoder()
        {
            var settings = new TextEncoderSettings(UnicodeRanges.BasicLatin, UnicodeRanges.Katakana, UnicodeRanges.Hiragana);
            settings.AllowCharacter('\u2014');  // allow EM DASH through
            return HtmlEncoder.Create(settings);
        }
    }
}

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.720 (1909/November2018Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.2.20119.13
  [Host]   : .NET Core 5.0.0 (CoreCLR 5.0.20.11807, CoreFX 5.0.20.11807), X64 RyuJIT
  ShortRun : .NET Core 5.0.0 (CoreCLR 5.0.20.11807, CoreFX 5.0.20.11807), X64 RyuJIT

Method	Mean	Error	StdDev	Ratio
Current	1.773 us	0.0804 us	0.0044 us	1.00
InPlace	2.152 us	0.0778 us	0.0043 us	1.21

@GrabYourPitchforks do you have any ideas on how the current implementation could be improved?

/cc @roji @benaadams

ghost commented 4 years ago

Tagging subscribers to this area: @tarekgh Notify danmosemsft if you want to be subscribed.

benaadams commented 4 years ago

Isn't vectorized but the Mono version has a simpler UTF-8 HTML encoding approach which might be worth looking at https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/CSharp/aspnetcore-mono/PlatformBenchmarks/Utilities/BufferExtensionsText.cs#L14-L243

roji commented 4 years ago

Note that this is using the pre-multiplexed version of Npgsql, so the actual time spent may be more than 7%.

Another note is that strings coming back from PostgreSQL are (in almost all cases) already in UTF-8, so this is a classical case of us decoding/reencoding UTF-8 for nothing...

GrabYourPitchforks commented 4 years ago

If you're building up a response from multiple constituent strings, the fastest way to do this would be to leave everything as chars (not bytes) while you're performing all of the intermediate work. Then once the final response text is built, perform a single UTF8 conversion from chars to bytes, then send those bytes across the wire.

GrabYourPitchforks commented 4 years ago

@benaadams the code you linked to is good for experimentation but isn't really something we can ship. It takes many shortcuts that aren't appropriate for a production web server.

GrabYourPitchforks commented 4 years ago

Let me expand on my above answers a little bit. If we're building up a response payload from UTF-8 components, the proposed Utf8String class (see https://github.com/dotnet/corefxlab/issues/2350) could also help with this.

Generally speaking, when transcoding data (UTF-8 to UTF-16 or vice versa), the transcoding APIs will give the best performance when they're called over larger chunks of data. That is: calling the APIs 100 times over a 64-byte buffer will be slower than calling the API once with a 6,400-byte buffer. So the goal should be to avoid intermediate transcodings as much as possible, preferring one final transcoding at the very end of the process.

Additionally, most text processing frameworks are more efficient when operating over UTF-16 data than UTF-8 data. So manipulating buffers of chars (as UTF-16) will generally give better performance compared to manipulating buffers of bytes (as UTF-8). I'm generalizing quite a bit here, but this is a good rule of thumb to follow.

Taken together, this means that one would reasonably expect the best performance with the following options, listed in descending order from fastest to slowest:

Operate fully in UTF-16 (as chars) while building up the response body, then perform a single UTF-16 to UTF-8 conversion right before the response body is sent to the wire.
Operate fully in UTF-8 while building up the response body.
Perform intermediate transcodings of UTF-8 to UTF-16 (or vice versa) while building up the response body, optionally performing a final transcoding at the very end if necessary.

benaadams commented 4 years ago

@benaadams the code you linked to is good for experimentation but isn't really something we can ship. It takes many shortcuts that aren't appropriate for a production web server.

The 2 main shortcuts are:

As we know we are encoding to Utf-8, only encoding 5 chars <,>,&,'," and control chars less than space; both of types which can be detected in a manner that is amenable to vectorization
Encode directly from a string to encoded uft8 html bytes to provided span

The way the using the framework libs work is

Create a custom HtmlEncoder to allow EM DASH, BasicLatin, Katakana and Hiragana through without encoding. This blocks any devirtualization that would be available via the sealed type DefaultHtmlEncoder of the virtual methods on TextEncoder
Process the string char-by-char to find first encoding point; copy up to that point to an encoded char[],
Process the rest of the string after the first encoding point char-by-char calling the virtual WillEncode for each char (note: devirtualization is blocked)
Create a new string from the char[] copying the data. We now have a html encoded utf-16 string.
Scan the new string to workout the size when converted to utf-8
Encode the utf-16 string to utf-8 span (this is probably now the fast bit?)

So the data ends up being scanned/processed many times to get the output

GrabYourPitchforks commented 4 years ago

As we know we are encoding to Utf-8, only encoding 5 chars <,>,&,'," and control chars less than space

We can't ship that as the default behavior as it would violate SDL requirements. Microsoft policy is that we need to use allow-listing encoders rather than deny-listing encoders. (There's nothing against including a deny-listing encoder inbox as far as I can tell, but it can't be enabled by default.)

I do have a separate work item to expand the default encoders to allow some non-ASCII characters by default. But they'd still be subject to "allow list"-style code paths.

Encode directly from a string to encoded uft8 html bytes to provided span

The WriteUtf8Encoding method takes incorrect shortcuts that could fail on certain inputs, such as inputs that contain characters from non-BMP planes.

Specifically, the Encoding classes can only operate over complete text buffers. If you have the string "abcdef" and send it through an Encoding instance, you might get a different result than if you send "abc" and "def" separately, concatenating their results. The only framework APIs which allow operating on partial text buffers are the Encoder / Decoder classes or System.Text.Unicode.UTF8.

benaadams commented 4 years ago

The WriteUtf8Encoding method takes incorrect shortcuts that could fail on certain inputs, such ...

Oh sure, it's just a copy of Encoder.Convert but without access to the fallback buffers. I don't deny its problematic.

My main point is the way the framework encodes string -> html string -> uft8-bytes has performance pitfalls.

The html encoder needs to be overly conservative because it doesn't know what is valid in the end text encoding (e.g. if it was encoding to ASCII or Latin1 it would need to convert to HTML character references almost all of the wider planes of unicode).

Adding extra planes for the the HTML encoder to allow through means you need to construct one yourself and as soon as the first html char to encode is hit every char after than is tested with a virtual call. At the end of that you then materialise a second string; which then needs to be Utf-8 encoded.

This ends up with many passes over the original utf-16 data (ignoring that it starts as a utf-8 string from the database); rather than streaming it out to html encoded utf-8 in a single(ish) pass.

Of the multi-pass; the HTML encoding is a little over twice as expensive as the UTF-8 encoding:

jeffhandley commented 1 year ago

Looking back at this issue, there's nothing actionable standing out. Closing.

dotnet / runtime

The fastest way to html encode string to Utf8 string #35004