Invenietis / CK-Text

Basic string and text utilities (StringMatcher, IEnumerable<string>.Concatenate, etc.)
MIT License
1 stars 2 forks source link

Out of range when apply a Substring call in GetText method #7

Open dh-jbouhet opened 4 years ago

dh-jbouhet commented 4 years ago

Context When we read a json file with a 256 bytes buffer size, we get on some files an exception when parsing the file.

Exception System.ArgumentOutOfRangeException HResult=0x80131502 Message=Index and length must refer to a location within the string. Source=System.Private.CoreLib StackTrace: at System.String.Substring(Int32 startIndex, Int32 length)

Problem explanation

When enter in the GetText method, if you have these parameters: index => 7042 length => 24

With this Buffer state: _bufferPosition => 6810 _bufferSize => 256

This line:

resultString += _encoding.GetString( _buffer ).Substring( (int)(index - _bufferPosition), length );

will throw an exception because we are trying to do a .Substring(232, 24); // 232 + 24 = 256 on a string of 255 bytes length.

Suggestion for the fix

Change this line:

if( !IsBuffered( index ) || index + length > _bufferPosition + _bufferSize ) MoveBuffer( index );

and add an equal operator:

if( !IsBuffered( index ) || index + length >= _bufferPosition + _bufferSize ) MoveBuffer( index );

olivier-spinelli commented 4 years ago

We setup a unit test and fix this. I was about to release the whole stack so this should be available (v9.0.1) this evening or tomorrow morning.

olivier-spinelli commented 4 years ago

Not reproducible:

will throw an exception because we are trying to do a .Substring(232, 24); // 232 + 24 = 256 on a string of 255 bytes length.

image

Could you elaborate?

olivier-spinelli commented 4 years ago

You can look at the repro I made in the commit above. I also removed NUnitLite in favor of VSTest adapter: it's easier to use, you're welcome to clone the repo and try to reproduce this issue. I keep this issue open for the moment.

dh-jbouhet commented 4 years ago

Ok thanks.

Here the exception from the debugger:

image

If I decompose:

"141222267657207,"attributedUnitsOrdered14dSameSKU":0,"attributedConversions14dSameSKU":0,"clicks":0,"asin":"B085CJH6YT","attributedSales14d":0,"campaignName":"ASIN Ausrichtung Zahnbürstenaufsatz","attributedUnitsOrdered30dSameSKU":0,"attributedUnitsOrdere"

If you do that on this string:

.Substring( (int)(index - _bufferPosition), length ) => .Substring( 232, 24)

You get an exception as I reproduced here:

https://dotnetfiddle.net/YYsizt

image

I will try to reproduce it on Monday with your unit tests.

bcrosnier commented 4 years ago

Be careful with that umlaut character (ü) in ASIN Ausrichtung Zahnbürstenaufsatz.

As a reminder, ü is 1 character, and 2 bytes in UTF-8.

@olivier-spinelli VirtualString.GetText( long index, int length ) claims that length is the length of the string, which is ambiguous - all operations use it as an array byte length and not simply a character index (since _buffer is a byte[], not a char[]).

It's not just the number of characters to read - it's the number of bytes to read from the underlying textStream.

EDIT: This will lead to potentially broken results if the buffer ends in the middle of a multibyte character, as demonstrated here:

public void virtual_string_does_not_like_multibyte_characters_1()
{
    var encoding = new UTF8Encoding( false );

    // i = 0 to 25: Single-byte
    // i = 26 and 27: Multi-byte
    // i = 28 to 53: Single-byte
    string testStr = @"ABCDEFGHIJKLMNOPQRSTUVWXYZüABCDEFGHIJKLMNOPQRSTUVWXYZ";
    byte[] utf8bytes = encoding.GetBytes( testStr );

    using( MemoryStream ms = new MemoryStream() )
    {
        ms.Write( utf8bytes );
        ms.Position = 0;

        var v = new VirtualString( ms, 0, 27, encoding );
        {
            // Bug (?): Reading 57 bytes does not read 57 characters, depending on encoding and content
            v.GetText( 0, 27 ).Should().Be( "ABCDEFGHIJKLMNOPQRSTUVWXYZü" ); // But is ABCDEFGHIJKLMNOPQRSTUVWXYZ?
        }
    }
}

But if the buffer does not end on a multibyte character and contains multibyte chartacters, VirtualString will break if trying to read the entire array:

[Test]
public void virtual_string_does_not_like_multibyte_characters_2()
{
    var encoding = new UTF8Encoding( false );

    // i = 0 to 25: Single-byte
    // i = 26 and 27: Multi-byte
    // i = 28 to 53: Single-byte
    string testStr = @"ABCDEFGHIJKLMNOPQRSTUVWXYZüABCDEFGHIJKLMNOPQRSTUVWXYZ";
    byte[] utf8bytes = encoding.GetBytes( testStr );

    using( MemoryStream ms = new MemoryStream() )
    {
        ms.Write( utf8bytes );
        ms.Position = 0;

        var v = new VirtualString( ms, 0, 256, encoding );
        {
            v.Invoking( _ => _.GetText( 0, utf8bytes.Length  ) ).Should().NotThrow(); // ArgumentOutOfRangeException
        }
    }
}
olivier-spinelli commented 4 years ago

Perfect Ben, thanks a lot. This clearly shows that the idea itself of faking a binary stream as being a huuuuge string is totally buggy as soon as multi-byte characters appear in the stream. The solution would be to maintain a set of byte offsets at which a mbcs has been found, and, before moving to a char-offset X, ensuring that the corresponding byte-offset Y (Y>=X) has been reached linearly from the beginning of the stream. An array of (long Y, long AccumulatedX) tuples could be used to memorize all that we need to implement the function long GetByteOffsetFromCharOffset( long charOffset )....

This is doable.... however since System.Text.Json now works at byte level, natively in UTF-8, and that this is clearly the future, I think we should have a better time to work on a brand new matcher, that would be based on the new System.Text.Json API rather than handling this....

dh-jbouhet commented 4 years ago

Ok, indeed, the problem seems to be more complex than I thought.

Here an low-level example to undertand the trap:

    public static void Main()
    {
        var simpleString = @"AA";
        Console.WriteLine(GetCharCount(simpleString)); // Display: 2
        var stringWithSpecialCharacter = @"üA";
        Console.WriteLine(GetCharCount(stringWithSpecialCharacter)); // Display: 1
    }
    public static int GetCharCount(string str)
    {
        var encoding = new UTF8Encoding( false );
        byte[] utf8bytes = encoding.GetBytes( str );
        using var ms = new MemoryStream();
        ms.Write( utf8bytes );
        ms.Position = 0;
        var buffer = new Byte[2];
        ms.Read(buffer, 0, buffer.Length);
        var cnt = encoding.GetCharCount(buffer);
        return cnt;
    }

https://dotnetfiddle.net/0ZSY9U

An other fix will be to count the "real" number of bytes get by the encoding method. When move the cursor, do not do that anymore:

void MoveBuffer( long index )
        {
            _textStream.Position = index;
            _textStream.Read( _buffer, 0, (int)Math.Min( _bufferSize, _length ) );
            _bufferPosition = index;
        }

But get a count of byte read in the buffer and increment the cursor from that data instead:

void MoveBuffer( long index )
        {
            // Get the real number of char read from _encoding.GetCharCount( _buffer );
            // Increment until we have buffer position == index
        }

And do the same when we call GetText, do not rely on index + length but do a GetCharCount before and adjust the returned string according to that. This is clearly a empirical approach.

Meantime, I will increase the buffer size and hope to not have this problem too often.

Thanks for your help.