Open dh-jbouhet opened 4 years ago
We setup a unit test and fix this. I was about to release the whole stack so this should be available (v9.0.1) this evening or tomorrow morning.
Not reproducible:
will throw an exception because we are trying to do a .Substring(232, 24); // 232 + 24 = 256 on a string of 255 bytes length.
Could you elaborate?
You can look at the repro I made in the commit above. I also removed NUnitLite in favor of VSTest adapter: it's easier to use, you're welcome to clone the repo and try to reproduce this issue. I keep this issue open for the moment.
Ok thanks.
Here the exception from the debugger:
If I decompose:
_encoding.GetString( _buffer )
returns: "141222267657207,"attributedUnitsOrdered14dSameSKU":0,"attributedConversions14dSameSKU":0,"clicks":0,"asin":"B085CJH6YT","attributedSales14d":0,"campaignName":"ASIN Ausrichtung Zahnbürstenaufsatz","attributedUnitsOrdered30dSameSKU":0,"attributedUnitsOrdere"
If you do that on this string:
.Substring( (int)(index - _bufferPosition), length )
=> .Substring( 232, 24)
You get an exception as I reproduced here:
https://dotnetfiddle.net/YYsizt
I will try to reproduce it on Monday with your unit tests.
Be careful with that umlaut character (ü
) in ASIN Ausrichtung Zahnbürstenaufsatz
.
As a reminder, ü
is 1 character, and 2 bytes in UTF-8.
@olivier-spinelli VirtualString.GetText( long index, int length )
claims that length
is the length of the string, which is ambiguous - all operations use it as an array byte length and not simply a character index (since _buffer
is a byte[]
, not a char[]
).
It's not just the number of characters to read - it's the number of bytes to read from the underlying textStream
.
EDIT: This will lead to potentially broken results if the buffer ends in the middle of a multibyte character, as demonstrated here:
public void virtual_string_does_not_like_multibyte_characters_1()
{
var encoding = new UTF8Encoding( false );
// i = 0 to 25: Single-byte
// i = 26 and 27: Multi-byte
// i = 28 to 53: Single-byte
string testStr = @"ABCDEFGHIJKLMNOPQRSTUVWXYZüABCDEFGHIJKLMNOPQRSTUVWXYZ";
byte[] utf8bytes = encoding.GetBytes( testStr );
using( MemoryStream ms = new MemoryStream() )
{
ms.Write( utf8bytes );
ms.Position = 0;
var v = new VirtualString( ms, 0, 27, encoding );
{
// Bug (?): Reading 57 bytes does not read 57 characters, depending on encoding and content
v.GetText( 0, 27 ).Should().Be( "ABCDEFGHIJKLMNOPQRSTUVWXYZü" ); // But is ABCDEFGHIJKLMNOPQRSTUVWXYZ?
}
}
}
But if the buffer does not end on a multibyte character and contains multibyte chartacters, VirtualString will break if trying to read the entire array:
[Test]
public void virtual_string_does_not_like_multibyte_characters_2()
{
var encoding = new UTF8Encoding( false );
// i = 0 to 25: Single-byte
// i = 26 and 27: Multi-byte
// i = 28 to 53: Single-byte
string testStr = @"ABCDEFGHIJKLMNOPQRSTUVWXYZüABCDEFGHIJKLMNOPQRSTUVWXYZ";
byte[] utf8bytes = encoding.GetBytes( testStr );
using( MemoryStream ms = new MemoryStream() )
{
ms.Write( utf8bytes );
ms.Position = 0;
var v = new VirtualString( ms, 0, 256, encoding );
{
v.Invoking( _ => _.GetText( 0, utf8bytes.Length ) ).Should().NotThrow(); // ArgumentOutOfRangeException
}
}
}
Perfect Ben, thanks a lot. This clearly shows that the idea itself of faking a binary stream as being a huuuuge string is totally buggy as soon as multi-byte characters appear in the stream. The solution would be to maintain a set of byte offsets at which a mbcs has been found, and, before moving to a char-offset X, ensuring that the corresponding byte-offset Y (Y>=X) has been reached linearly from the beginning of the stream. An array of (long Y, long AccumulatedX) tuples could be used to memorize all that we need to implement the function long GetByteOffsetFromCharOffset( long charOffset )....
This is doable.... however since System.Text.Json now works at byte level, natively in UTF-8, and that this is clearly the future, I think we should have a better time to work on a brand new matcher, that would be based on the new System.Text.Json API rather than handling this....
Ok, indeed, the problem seems to be more complex than I thought.
Here an low-level example to undertand the trap:
public static void Main()
{
var simpleString = @"AA";
Console.WriteLine(GetCharCount(simpleString)); // Display: 2
var stringWithSpecialCharacter = @"üA";
Console.WriteLine(GetCharCount(stringWithSpecialCharacter)); // Display: 1
}
public static int GetCharCount(string str)
{
var encoding = new UTF8Encoding( false );
byte[] utf8bytes = encoding.GetBytes( str );
using var ms = new MemoryStream();
ms.Write( utf8bytes );
ms.Position = 0;
var buffer = new Byte[2];
ms.Read(buffer, 0, buffer.Length);
var cnt = encoding.GetCharCount(buffer);
return cnt;
}
https://dotnetfiddle.net/0ZSY9U
An other fix will be to count the "real" number of bytes get by the encoding method. When move the cursor, do not do that anymore:
void MoveBuffer( long index )
{
_textStream.Position = index;
_textStream.Read( _buffer, 0, (int)Math.Min( _bufferSize, _length ) );
_bufferPosition = index;
}
But get a count of byte read in the buffer and increment the cursor from that data instead:
void MoveBuffer( long index )
{
// Get the real number of char read from _encoding.GetCharCount( _buffer );
// Increment until we have buffer position == index
}
And do the same when we call GetText, do not rely on index + length but do a GetCharCount before and adjust the returned string according to that. This is clearly a empirical approach.
Meantime, I will increase the buffer size and hope to not have this problem too often.
Thanks for your help.
Context When we read a json file with a 256 bytes buffer size, we get on some files an exception when parsing the file.
Exception System.ArgumentOutOfRangeException HResult=0x80131502 Message=Index and length must refer to a location within the string. Source=System.Private.CoreLib StackTrace: at System.String.Substring(Int32 startIndex, Int32 length)
Problem explanation
When enter in the GetText method, if you have these parameters: index => 7042 length => 24
With this Buffer state: _bufferPosition => 6810 _bufferSize => 256
This line:
resultString += _encoding.GetString( _buffer ).Substring( (int)(index - _bufferPosition), length );
will throw an exception because we are trying to do a
.Substring(232, 24); // 232 + 24 = 256
on a string of 255 bytes length.Suggestion for the fix
Change this line:
if( !IsBuffered( index ) || index + length > _bufferPosition + _bufferSize ) MoveBuffer( index );
and add an equal operator:
if( !IsBuffered( index ) || index + length >= _bufferPosition + _bufferSize ) MoveBuffer( index );