SceneGate / Yarhl

Framework for the implementation of format converters like game assets or media files
https://scenegate.github.io/Yarhl/
MIT License
60 stars 10 forks source link

:sparkles: Implement simple base class for custom encoding implementations #181

Closed pleonex closed 1 year ago

pleonex commented 2 years ago

Description

Add a simple abstract class to implement custom encoding implementations. It only requires to provide the encode and decoding methods base on the performance-based Span and it provides help methods to report invalid chars / bytes. The byte/char count methods will re-use the encode and decode methods, as typically the implementation is very similar and the performance is not affected. This class was already use in Metatron to implement the Persona encoding.

Also refactor the performance test app so we can run all the tests individually.

Performance comparison

The performance between a custom implementation on Encoding and on this class is practically the same. Thanks to the advance usage of Span, simplified in this class, the memory is 3x times better and consumes the same as the standard implementations of .NET. In this case we compare a custom Shift-JIS implementation with the .NET implementation. The performance is worse than the .NET API provided encoding as they do optimizations at low-level with binary dictionaries. Our implementation is a typically use-case with conditions following the specs. The values are good for a maximum use case of 5 MB of text (0.5 sec), the memory is great and the implementation is very simple.


BenchmarkDotNet=v0.13.1, OS=fedora 34
Intel Core i7-4720HQ CPU 2.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.102
  [Host]     : .NET 6.0.2 (6.0.222.11401), X64 RyuJIT
  DefaultJob : .NET 6.0.2 (6.0.222.11401), X64 RyuJIT
Method TextLength Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Sjis 280 3.428 μs 0.0336 μs 0.0314 μs 0.19 0.00 0.3662 - - 1 KB
SjisCustomEncoding 280 18.076 μs 0.1248 μs 0.1167 μs 1.00 0.00 0.9460 - - 3 KB
SjisCustomYarhlEncoding 280 19.227 μs 0.1089 μs 0.1019 μs 1.06 0.01 0.3967 - - 1 KB
Sjis 5242880 59,626.926 μs 430.4532 μs 402.6462 μs 0.14 0.00 222.2222 222.2222 222.2222 20,225 KB
SjisCustomEncoding 5242880 413,645.574 μs 3,893.8989 μs 3,642.3555 μs 1.00 0.00 - - - 50,950 KB
SjisCustomYarhlEncoding 5242880 462,457.768 μs 8,352.3387 μs 7,812.7828 μs 1.12 0.02 - - - 20,227 KB

Example

private sealed class CustomSjisYarhlEncoding : SimpleSpanEncoding
{
    private readonly Dictionary<int, int> codeToUnicode;
    private readonly Dictionary<int, int> unicodeToCode;

    public CustomSjisYarhlEncoding(Dictionary<int, int> codeToUnicode, Dictionary<int, int> unicodeToCode)
        : base(0, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback)
    {
        this.codeToUnicode = codeToUnicode;
        this.unicodeToCode = unicodeToCode;
    }

    public override string EncodingName => "sjis-yarhl";

    public override int GetMaxByteCount(int charCount) => charCount * 2;

    public override int GetMaxCharCount(int byteCount) => byteCount;

    protected override void Decode(ReadOnlySpan<byte> bytes, SpanStream<char> buffer)
    {
        byte lead = 0x00;
        int count = bytes.Length;
        for (int i = 0; i < count; i++) {
            int codePoint = -1;
            byte current = bytes[i];

            if (lead != 0x00) {
                int offset = (current < 0x7F) ? 0x40 : 0x41;
                int leadOffset = (lead < 0xA0) ? 0x81 : 0xC1;

                bool inRange1 = current is >= 0x40 and <= 0x7E;
                bool inRange2 = current is >= 0x80 and <= 0xFC;
                if (!inRange1 && !inRange2) {
                    DecodeUnknownBytes(buffer, i, current);
                }

                int pointer = ((lead - leadOffset) * 188) + current - offset;
                if (pointer is 8836 and <= 10715) {
                    codePoint = 0xE000 - 8836 + pointer;
                } else {
                    if (!codeToUnicode.TryGetValue(pointer, out codePoint)) {
                        DecodeUnknownBytes(buffer, i, current);
                    }
                }

                lead = 0x00;
            } else if (current == 0x5C) {
                codePoint = 0x00A5; // yen
            } else if (current == 0x7E) {
                codePoint = 0x203E; // overline
            } else if (current < 0x80) {
                codePoint = current;
            } else if (current is >= 0xA1 and <= 0xDF) {
                codePoint = 0xFF61 - 0xA1 + current;
            } else if (current is(>= 0x81 and <= 0x9F) or(>= 0xE0 and <= 0xFC)) {
                lead = current;
            } else {
                throw new FormatException();
            }

            if (codePoint != -1) {
                buffer.Write((char)codePoint);
            }
        }

        // 1.
        if (lead != 0x00) {
            DecodeUnknownBytes(buffer, count - 2, lead);
        }
    }

    protected override void Encode(ReadOnlySpan<char> chars, SpanStream<byte> buffer, bool isFallbackText = false)
    {
        int count = chars.Length;
        for (int i = 0; i < count; i++) {
            ushort codePoint = chars[i];

            if (codePoint == 0x00A5) {
                buffer.Write(0x5C);
            } else if (codePoint == 0x203E) {
                buffer.Write(0x7E);
            } else if (codePoint < 0x80) {
                buffer.Write((byte)codePoint);
            } else if (codePoint is >= 0xFF61 and <= 0xFF9F) {
                buffer.Write((byte)(codePoint - 0xFF61 + 0xA1));
            } else {
                if (codePoint == 0x2212) {
                    codePoint = 0xFF0D;
                }

                if (!unicodeToCode.TryGetValue(codePoint, out int code)) {
                    EncodeUnknownChar(buffer, codePoint, i, isFallbackText);
                }

                int lead = code / 188;
                int leadOffset = (lead < 0x1F) ? 0x81 : 0xC1;
                int trail = code % 188;
                int offset = (trail < 0x3F) ? 0x40 : 0x41;
                buffer.Write((byte)(lead + leadOffset));
                buffer.Write((byte)(trail + offset));
            }
        }
    }
}