Open GoogleCodeExporter opened 8 years ago
Original comment by noah.hart@gmail.com
on 29 Oct 2012 at 7:41
It seems that in file utf_c.cs in function
static u32 sqlite3Utf8Read(
string zIn, /* First byte of UTF-8 character */
ref string pzNext /* Write first byte past UTF-8 char here */
)
the line:
while ( zIndex != zIn.Length && ( zIn[zIndex] & 0xc0 ) == 0x80 )
needs to be changed to:
while ( zIndex != zIn.Length && ( zIn[zIndex -1] & 0xc0 ) == 0x80 )
Being new to the coding architecture of SQLite / c#-SQLite, it took me some two
days to find this. Please help me testing whether this now fits for all UTF-8
characters.
Best regards
LLD
Original comment by imapim...@gmail.com
on 23 Dec 2012 at 10:46
Thanks LLD for your effort!
I've tested your fix and it worked for the query:
select * from Stock where Name like '%翡翠%
But I prepared another database file, and tried this query but failed:
"select * from Stock where Name like \'%玛瑙%\'"
REPRO STEPS:
1. Download the test code
https://docs.google.com/open?id=0BwKD9xrjSqz0cUNsSEpBSjNxeWM . I've changed the
code to search %玛瑙%. And a new database file is in the zip file.
2. Run TestCsharpSqliteInterface
3. Check the follow function's result.
private static string TestQuery(Sqlite3.sqlite3 database, string sql)
Expected:
The query should return 2 records.
Actually:
No records returned.
Other Information:
There are two records in the database.
sqlite> select * from Stock;
964|20000458|南红玛瑙挂坠||800.0|1353415371|3.0|件|0|0|1353415803|0|0|2|2
0000458 2000-0458 南红玛瑙挂坠 NA;hong;MA;NAO;GUA;ZHUI; 2012/11/20
12:42:50 2012/11/20 12:50:03
973|20000458|南红玛瑙挂坠||800.0|1353415371|3.0|件|1|0|1353418029|0|1|2|2
0000458 2000-0458 南红玛瑙挂坠 NA;hong;MA;NAO;GUA;ZHUI; 2012/11/20
12:42:51 2012/11/20 13:27:09
Original comment by sunshak...@gmail.com
on 24 Dec 2012 at 2:56
Hm, I admit I only did few testing. Actually, this function is pretty odd. When
I comment out most of its code it seems to cover all LIKE operations:
int zIndex = 0;
u32 c = zIn[zIndex++];
//if ( c >= 0xc0 )
//{
// //if ( c > 0xff ) c = 0;
// //else
// {
// //c = sqlite3Utf8Trans1[c - 0xc0]; 1100 0000 1000 0000
// while ( zIndex != zIn.Length && ( zIn[zIndex -1] & 0xc0 ) == 0x80 )
// {
// c = (u32)( ( c << 6 ) + ( 0x3f & zIn[zIndex++] ) );
// }
// if ( c < 0x80
// || ( c & 0xFFFFF800 ) == 0xD800
// || ( c & 0xFFFFFFFE ) == 0xFFFE )
// {
// c = 0xFFFD;
// }
// }
//}
pzNext = zIn.Substring( zIndex );
return c;
Of course, comparison is now done in UNICODE, not UTF-8, but for the LIKE
operator this seems to work.
Sunshak, can you try this out?
I will further look into this during the next days and hopefully will be able
to come up with a more solid solution.
Original comment by imapim...@gmail.com
on 24 Dec 2012 at 4:14
Hi LLD,
Looks nice. I've tested a couple of queries, and did not find any issue yet. I
will make a deep test later.
Thank you for your help! :-)
Original comment by sunshak...@gmail.com
on 24 Dec 2012 at 8:34
I've tried as many keywords as possible, and no issue found.
The fix looks reliable.
I think the fix could be get reviewed by the project owner.
Original comment by sunshak...@gmail.com
on 24 Dec 2012 at 12:47
Have you tested with the full testsuite or just this one edge case?
Original comment by noah.hart@gmail.com
on 24 Dec 2012 at 1:21
I've tested it manually with keywords random came into my mind, and as many as
possible, including this one edge case.
Original comment by sunshak...@gmail.com
on 24 Dec 2012 at 1:32
The function in comment #4 correctly handles all UNICODE characters in the
range of UNICODE code points U+0000 to U+FFFF (which covers the most frequently
used characters). However, I think it won't correctly handle characters in the
code point range U+10000 to U+10FFFF. This conclusion is not (yet) based on
tests but on the following literature:
The Microsoft docs say:
"Indexes [of Strings]
An index is the position of a Char object, not a Unicode character, in a
String. An index is a zero-based, nonnegative number starting from the first
position in the string, which is index position zero. Consecutive index values
might not correspond to consecutive Unicode characters because a Unicode
character might be encoded as more than one Char object. To work with each
Unicode character instead of each Char object, use the
System.Globalization.StringInfo class.
(http://msdn.microsoft.com/en-us/library/system.string(v=vs.80).aspx)"
Since the function gets an index value (u32 c = zIn[zIndex++];), this will
always be a 16 bit value that may not equal a character's UNICODE value in case
the character is a surrogate or combining character.
The Microsoft docs for the StringInfo class
(http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo(v=vs.80
).aspx) include a sample that I think serves as a good starting point to make
function sqlite3Utf8Read capable of handling characters beyond code point
U+FFFF correctly.
BUT then again: The original SQLite c code comments say that "this routine
never allows a UTF16 surrogate value to be encoded [...]" and instead returns
0xfffd in such cases. The question is whether C#-SQLite should correctly
implement the UNICODE standard here or rather act the same way SQLite does.
I will conduct further tests on this issue and work out a fix if necessary.
Unfortunately, I won't be able to do this before something like mid of January.
Best regards
LLD
Original comment by imapim...@gmail.com
on 26 Dec 2012 at 2:57
I really appreciate your work. Hope you will be back after the new year
vacation.
Happy new year! :-)
Xueqing Sun
Original comment by sunshak...@gmail.com
on 27 Dec 2012 at 2:08
Original issue reported on code.google.com by
sunshak...@gmail.com
on 24 Oct 2012 at 12:55