Failed to querying with like operator for a record contains Chinese chars

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Download the source code which help you to reproduce the issue. Download 
link: https://docs.google.com/open?id=0BwKD9xrjSqz0Vzd6LUV5aGhkeFU

2. Open the solution CsharpSqlite_LikeOperatorTest.sln.

3. In the Visual Studio, run  
CsharpSqlite_LikeOperatorTest\Debug\TestSqliteCInterface.exe. This is an app 
which uses C interface of sqlite3. It query the database with the statement: 
select * from Stock where Name like '%翡翠%'. And It will successfully get 1 
row result.

Note: please don't run directly in the explorer's folder since the app relys on 
Visual Studio's environment path settings.

4.Also in the Visual Studio, run 
CsharpSqlite_LikeOperatorTest\TestCsharpSqliteInterface\bin\Debug\TestCsharpSqli
teInterface.exe. This is an app which uses Csharp-Sqlite. It query the database 
uses the same sql query as step 3:
 select * from Stock where Name like '%翡翠%'.
But no records will be returned.

What is the expected output? What do you see instead?
In step 4, a record should be returned:
35|110000020|满绿翡翠钻石黄金戒指|oeifjasdjfadsjfakdjfalkjaljdg|12000.
0|1331834280|10.0|只|1|0|1331834280|1|0|1|

What version of the product are you using? On what operating system?
SQLite version 3.7.10 2012-01-16 13:28:40
csharp-sqlite_3_7_7_1_71.zip

Please provide any additional information below.
The database Ptolemaic.db have a table to be used in the reproducing code, is 
Stock. In the table there is only one record:
35|110000020|满绿翡翠钻石黄金戒指|oeifjasdjfadsjfakdjfalkjaljdg|12000.
0|1331834280|10.0|只|1|0|1331834280|1|0|1|
If you query the table with:
 select * from Stock where Name like '%满绿翡翠钻石黄金戒指%'
CSharp-sqlite will return the only record. But if you query with
 select * from Stock where Name like '%翡翠%'
CSharp-sqlite will return nothing.
In both cases, Sqlite3 C interface will always return the only record correctly.

So many thanks!

Original issue reported on code.google.com by sunshak...@gmail.com on 24 Oct 2012 at 12:55

GoogleCodeExporter commented 9 years ago

Original comment by noah.hart@gmail.com on 29 Oct 2012 at 7:41

Changed state: Reviewed

GoogleCodeExporter commented 9 years ago

It seems that in file utf_c.cs in function

static u32 sqlite3Utf8Read(
    string zIn,          /* First byte of UTF-8 character */
    ref string pzNext   /* Write first byte past UTF-8 char here */
    )

the line:
          while ( zIndex != zIn.Length && ( zIn[zIndex] & 0xc0 ) == 0x80 )

needs to be changed to: 
          while ( zIndex != zIn.Length && ( zIn[zIndex -1] & 0xc0 ) == 0x80 )

Being new to the coding architecture of SQLite / c#-SQLite, it took me some two 
days to find this. Please help me testing whether this now fits for all UTF-8 
characters.

Best regards

LLD

Original comment by imapim...@gmail.com on 23 Dec 2012 at 10:46

GoogleCodeExporter commented 9 years ago

Thanks LLD for your effort!
I've tested your fix and it worked for the query:
select * from Stock where Name like '%翡翠%

But I prepared another database file, and tried this query but failed:
"select * from Stock where Name like \'%玛瑙%\'"

REPRO STEPS:
1. Download the test code 
https://docs.google.com/open?id=0BwKD9xrjSqz0cUNsSEpBSjNxeWM . I've changed the 
code to search %玛瑙%. And a new database file is in the zip file.
2. Run TestCsharpSqliteInterface
3. Check the follow function's result.
private static string TestQuery(Sqlite3.sqlite3 database, string sql)

Expected:
The query should return 2 records.

Actually:
No records returned.

Other Information:

There are two records in the database.
sqlite> select * from Stock;
964|20000458|南红玛瑙挂坠||800.0|1353415371|3.0|件|0|0|1353415803|0|0|2|2
0000458 2000-0458 南红玛瑙挂坠 NA;hong;MA;NAO;GUA;ZHUI;   2012/11/20 
12:42:50 2012/11/20 12:50:03
973|20000458|南红玛瑙挂坠||800.0|1353415371|3.0|件|1|0|1353418029|0|1|2|2
0000458 2000-0458 南红玛瑙挂坠 NA;hong;MA;NAO;GUA;ZHUI;   2012/11/20 
12:42:51 2012/11/20 13:27:09

Original comment by sunshak...@gmail.com on 24 Dec 2012 at 2:56

GoogleCodeExporter commented 9 years ago

Hm, I admit I only did few testing. Actually, this function is pretty odd. When 
I comment out most of its code it seems to cover all LIKE operations:

      int zIndex = 0;
      u32 c = zIn[zIndex++];

      //if ( c >= 0xc0 )
      //{
      //  //if ( c > 0xff ) c = 0;
      //  //else
      //  {
      //    //c = sqlite3Utf8Trans1[c - 0xc0];            1100 0000  1000 0000
      //    while ( zIndex != zIn.Length && ( zIn[zIndex -1] & 0xc0 ) == 0x80 )
      //    {
      //      c = (u32)( ( c << 6 ) + ( 0x3f & zIn[zIndex++] ) );
      //    }
      //    if ( c < 0x80
      //    || ( c & 0xFFFFF800 ) == 0xD800
      //    || ( c & 0xFFFFFFFE ) == 0xFFFE )
      //    {
      //      c = 0xFFFD;
      //    }
      //  }
      //}
      pzNext = zIn.Substring( zIndex );
      return c;

Of course, comparison is now done in UNICODE, not UTF-8, but for the LIKE 
operator this seems to work.
Sunshak, can you try this out?

I will further look into this during the next days and hopefully will be able 
to come up with a more solid solution.

Original comment by imapim...@gmail.com on 24 Dec 2012 at 4:14

GoogleCodeExporter commented 9 years ago

Hi LLD,

Looks nice. I've tested a couple of queries, and did not find any issue yet. I 
will make a deep test later.

Thank you for your help!  :-)

Original comment by sunshak...@gmail.com on 24 Dec 2012 at 8:34

GoogleCodeExporter commented 9 years ago

I've tried as many keywords as possible, and no issue found.
The fix looks reliable.
I think the fix could be get reviewed by the project owner.

Original comment by sunshak...@gmail.com on 24 Dec 2012 at 12:47

GoogleCodeExporter commented 9 years ago

Have you tested with the full testsuite or just this one edge case?

Original comment by noah.hart@gmail.com on 24 Dec 2012 at 1:21

GoogleCodeExporter commented 9 years ago

I've tested it manually with keywords random came into my mind, and as many as 
possible, including this one edge case.

Original comment by sunshak...@gmail.com on 24 Dec 2012 at 1:32

GoogleCodeExporter commented 9 years ago

The function in comment #4 correctly handles all UNICODE characters in the 
range of UNICODE code points U+0000 to U+FFFF (which covers the most frequently 
used characters). However, I think it won't correctly handle characters in the 
code point range U+10000 to U+10FFFF. This conclusion is not (yet) based on 
tests but on the following literature:

The Microsoft docs say: 

"Indexes [of Strings]
An index is the position of a Char object, not a Unicode character, in a 
String. An index is a zero-based, nonnegative number starting from the first 
position in the string, which is index position zero. Consecutive index values 
might not correspond to consecutive Unicode characters because a Unicode 
character might be encoded as more than one Char object. To work with each 
Unicode character instead of each Char object, use the 
System.Globalization.StringInfo class. 
(http://msdn.microsoft.com/en-us/library/system.string(v=vs.80).aspx)"

Since the function gets an index value (u32 c = zIn[zIndex++];), this will 
always be a 16 bit value that may not equal a character's UNICODE value in case 
the character is a surrogate or combining character.

The Microsoft docs for the StringInfo class 
(http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo(v=vs.80
).aspx) include a sample that I think serves as a good starting point to make 
function sqlite3Utf8Read capable of handling characters beyond code point 
U+FFFF correctly.

BUT then again: The original SQLite c code comments say that "this routine 
never allows a UTF16 surrogate value to be encoded [...]" and instead returns 
0xfffd in such cases. The question is whether C#-SQLite should correctly 
implement the UNICODE standard here or rather act the same way SQLite does. 

I will conduct further tests on this issue and work out a fix if necessary. 
Unfortunately, I won't be able to do this before something like mid of January.

Best regards

LLD

Original comment by imapim...@gmail.com on 26 Dec 2012 at 2:57

GoogleCodeExporter commented 9 years ago

I really appreciate your work. Hope you will be back after the new year 
vacation.

Happy new year!  :-)

Xueqing Sun

Original comment by sunshak...@gmail.com on 27 Dec 2012 at 2:08

GitOffice / csharp-sqlite

Failed to querying with like operator for a record contains Chinese chars #170