Retrieving data from a ShiftJIS Interbase instance

MichaelCharles commented 1 year ago

Interbase version

SQL> SHOW VERSION;
ISQL Version: WI-V14.4.0.804
InterBase/x64/Windows (access method), version "WI-V14.0.0.97"
InterBase/x64/Windows (remote server), version "WI-V14.0.0.97/tcp (OKW157-VT01)/P15"
InterBase/x64/Windows (remote interface), version "WI-V14.4.0.804/tcp (Skylight)/P15"
on disk structure version 18.0

.NET Core 6 ASP.NET Project

Entity Framework/Interbase related packages:

    <PackageReference Include="InterBaseSql.Data.InterBaseClient" Version="7.13.6" />
    <PackageReference Include="InterBaseSql.EntityFrameworkCore.InterBase" Version="7.13.6" />
    <PackageReference Include="Microsoft.EntityFrameworkCore" Version="6.0.11" />
    <PackageReference Include="Microsoft.EntityFrameworkCore.Design" Version="6.0.11" />
    <PackageReference Include="Microsoft.EntityFrameworkCore.SqlServer" Version="6.0.11" />
    <PackageReference Include="Microsoft.EntityFrameworkCore.Tools" Version="6.0.11">

I have Interbase 2022 installed on my local machine including the ODBC driver.

My Windows machine's encoding is set to ShiftJIS (as it is the default Japanese encoding).

The remote Interbase instance does not have an encoding set, which to my understanding means that it should default to the host machine's encoding. The host machine of that remote instance is also set to ShiftJIS.

If I connect to the instance using isql, I can properly retrieve Japanese encoded text.

Since I am having trouble using Entity Framework's ORM functions as described here, I've set up a different endpoint to execute SQL directly which looks like this,

[HttpGet("test")]
        public async Task<string> Test()
        {
            var result = "";
            using (var command = db.Database.GetDbConnection().CreateCommand())
            {
                command.CommandText = "SELECT KANK_NAME1_KANJI FROM KANK";

                db.Database.OpenConnection();

                using (var reader = await command.ExecuteReaderAsync())
                {
                    if (reader.Read())
                    {
                        result = "";
                        result += reader.GetString(0);
                    }
                }
            }
            return result;
        }

If I set the connection string to @"User=USERNAME;Password=password;Database=c:\DATABASE.GDB;DataSource=999.999.999.999 and execute it, I get this result,

�q�j�a���������������

It seems like I'm having some kind of problem with the encoding. I thought that I should perhaps try to specify the encoding inside the connection string, however if I set the connection string to @"User=USERNAME;Password=password;Database=c:\DATABASE.GDB;DataSource=999.999.999.999;Charset=SJIS_0208 and execute it, I get this result

InterBaseSql.Data.InterBaseClient.IBException (0x80004005): arithmetic exception, numeric overflow, or string truncation
Cannot transliterate character between character sets
 ---> arithmetic exception, numeric overflow, or string truncation
Cannot transliterate character between character sets
   at InterBaseSql.Data.InterBaseClient.IBCommand.Fetch()
   at InterBaseSql.Data.InterBaseClient.IBDataReader.Read()
   at PDFCreationMicroservice.Controllers.ContactsController.Test() in C:\Users\micha\source\repos\PDFCreationMicroservice\PDFCreationMicroservice\Controllers\ContactsController.cs:line 35
   at lambda_method5(Closure , Object )
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.AwaitableObjectResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Awaited|12_0(ControllerActionInvoker invoker, ValueTask`1 actionResultValueTask)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|20_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
   at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
   at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext)
   at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider)
   at Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware.Invoke(HttpContext context)

HEADERS
=======
Accept: text/plain
Host: localhost:7248
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
:method: GET
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,ja;q=0.8,ja-JP;q=0.7
Referer: https://localhost:7248/swagger/index.html
sec-ch-ua: "Google Chrome";v="111", "Not(A:Brand";v="8", "Chromium";v="111"
DNT: 1
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-site: same-origin
sec-fetch-mode: cors
sec-fetch-dest: empty

I tried setting breakpoints and debugging this, and it seems like this is coming from the reader.Read() call. It seems like perhaps the reader is unable to handle ShiftJIS?

jeffovercash commented 1 year ago

The � is expected if the data stored has ordinal values > than what ANSI supports which SJIS can have. While it comes over the wire in bytes as SJIS characters, the client side doesn't know how to convert those bytes to SJIS since SJIS is not in the connection string. So it converts the bytes over the wire using the ANSI encoding to UTF16 for Windows display. Any code points outside the ANSI range gets the � in UTF16.

I do not see any issues with how SJIS is set up when used as the character set in the connection string just reading the code. The error message usually means that the data in the column is too big to be placed in the buffer set aside for the SJIS-converted equivalent. This is difficult to track down often on variable multibyte code points like SJIS is. So when data gets stored as if it is ANSI you can fool IB into storing more bytes than is allowed by definition. This was often used when there was a much smaller index total column size limit in older versions but it has been gone for over a decade.

So a VarChar(10) character set SJIS column could hold up to 20 bytes of data (10 * the maximum size of a character in this case 2). Since the ANSI characters are all 1 byte if you make an ANSI (no character set) connection to IB you could actually store the string '12345678901234567890' into that column because IB only tests for the passed bytes fitting in the max allotted space not if the passed characters match the definition. But when you try to pull it back using SJIS as the character set on the connection string those 20 bytes get converted to SJIS on the server side and it overflowed the buffer of 20 bytes when converted.

While I have a lot of UTF8 unit tests and mostly ANSI, there are none dedicated to SJIS alone. I'll see about trying to write some this weekend and verify things are working right, but what I described above is often what is going on with that error message.

One thing you can try (since you have the data and I don't and this is a data issue somehow) is to make an ANSI connection then create an SJIS Encoding object and convert the bytes of the field to an SJIS string and look at that. That should eliminate those � and you can count the number of characters in the string to look for > the definition.

In general, my recommendation when using a variable multibyte character set like SJIS in a DB is always to set the character set, don't let things go in as ANSI. While just letting it go as ansi works on the machines that have as their default char set locally to Japanese, you run into issues when using a non-Japanese machine and now setting the SJIS char set.

MichaelCharles commented 1 year ago

We have a legacy Delphi application which connected to Interbase (dialect 1), and I'm now writing a newer application in C# which is meant to connect to that same database.

I can confirm that the setup for the legacy application is/was

Server and clients are all Japanese Windows machines using ShiftJIS
Interbase itself is not set to a specific character set
Connect string did not explictly set the character set

For my own personal development environment, I initially had my machine set to Windows/English with the beta UTF8 feature turned on. In an effort to try to increase comptability, I changed my machined to Japanese with the UTF8 feature turned off. It doesn't seem like that has had any effect on what Entity Framework is doing.

MichaelCharles commented 1 year ago

Thanks for that thorough, helpful, explanation by the way. That does seem to be consistent with the legacy application's configuration, and likely is the cause of my current problem.

MichaelCharles commented 1 year ago

One thing you can try (since you have the data and I don't and this is a data issue somehow) is to make an ANSI connection then create an SJIS Encoding object and convert the bytes of the field to an SJIS string and look at that. That should eliminate those � and you can count the number of characters in the string to look for > the definition.

I've been working on this for over a day trying various things, and I haven't made any progress on it.

It seems like the string I'm getting back from the first place is already corrupted in such a way that it is unusable.

I'm attaching a file sjis.txt which is the correct ShiftJIS version of the column I'm trying to retrieve.

sjis.txt

The next file, received.txt is what I'm getting back from the DB/Entity Framework.

received.txt

If I open sjis.txt in something like VSCode, and then reopen it as though it is ShiftJIS, the resulting text is ＲＫＢ毎日放送株式会社.

If I open received.txt and do the same thing, the resulting text is �ｿｽq�ｿｽj�ｿｽa�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ.

Analyizing the files on a byte level, the received string is about 3 times longer than it ought to be.

Screenshot 2023-04-06 103833 Screenshot 2023-04-06 103841

Interestingly, you can see the 71, 6A, and 61 in both string which map to q, j, and a in ANSI.

If tried using the Encoding object to take the received text and make it something readable by doing various variations on this:

byte[] bytesJIS = System.Text.Encoding.Default.GetBytes(stringJIS);

var converted = System.Text.Encoding.GetEncoding("Shift_JIS").GetString(bytesJIS);

But doing so has only ever resulting in the �ｿｽq�ｿｽj�ｿｽa�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ string.

MichaelCharles commented 1 year ago

So, I've been trying the InterBaseSql.EntityFrameworkCore.InterBase library in various environments/setups, and here is what I've found.

Language/Locale/Unicode Setting	.NET Framework	.NET Core
EN/JP/UTF16	✅	❌
EN/JP/UTF8	❌	❌
EN/US/UTF16	⚠️	❌
EN/US/UTF8	❌	❌
JA/JP/UTF16	✅	❌
JA/JP/UTF8	❌	❌

✅ - means it displayed correctly without me doing anything extra ⚠️ - means it displayed correclty after I manually used the Encoding class to read it as ShiftJIS ❌ - means it displayed incorrectly, and was I unable to use Encoding to read it as ShiftJIS (it was in the unusable state described in the previous comment)

So it seems like there's two potential problems: Something to do with using .NET Core, and something to do with using the beta Unicode UTF-8 locale setting on Windows.

jeffovercash commented 1 year ago

Can you attach either the table structure or just a small DB with data in it and a sample app for me? Using just the one line of data found in the sjis.txt file I am not able to reproduce the problem you are seeing on the .NET framework with EN/US/UTF16 (my default setup). It displays fine with no changes using the sample simple grid demo that is included in the installer for the driver. Having at least your sample DB side at minimum eliminates guesswork on my side to how to setup this problem.

Your results for the .NET core is worrisome because if the .NET core pulls the data in bytes and not as strings losing the encoding of those bytes along the way.

MichaelCharles commented 1 year ago

The table in question has contact information (names, addresses, phone numbers) for a large variety of companies. I'm currently talking to coworkers about whether or not I can pass that over, if it needs to be masked, etc.

I'll try to get you something as soon as possible.

jeffovercash commented 1 year ago

I don't need that table itself if you can give me something that demonstrates it so I can work from there. I understand problems with private information.

MichaelCharles commented 1 year ago

https://github.com/mcaubrey/masked-kaikei-backup/blob/main/link

You should be able to reproduce the problem using the backup I made and running the following

select KANK_NAME1_KANJI from KANK ROWS 1

It should come back as 沖縄はすごいよね if it is working correctly, and at least in my case if I try it from .NET Core it comes back as ・ｽ・ｽ・ｽ・ｽﾍゑｿｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ.

Also, I failed to mention this earlier, but I systematically tested both InterBaseSql.EntityFrameworkCore.InterBase and InterBaseSql.Data.InterBaseClient in the way described in the above comment and got this result.

jeffovercash commented 1 year ago

Thanks, I have doctor's appointments all day tomorrow but will look at this on Friday or the weekend. This should help a lot.

MichaelCharles commented 1 year ago

No worries.

I think for now we're going to move forward with a different approach, so there's no rush to get this figured out on our part. We certainly might want to come back to this (this being creating a new .NET project which connects to our old InterBase instance) in the future though, and I will certainly cooperate with trying to figure out what is going on here.

Please take care of your health 👍

jeffovercash commented 1 year ago

Ok, I have not ignored this, but just haven't had a lot to report on it.

Note - I used lc_ctype and character set interchangeably. Character set gets mapped to the lc_ctype in the connection data so more how I think of it, so where you see lc_ctype that is the equivalent of setting the character set in the connection string.

Here is the problem and maybe a solution. The Fb code we based everything on assumed that if there was no code page (a 0 sql_sub_type on the XSQLVar) that what was stored was UTF8. In my experience that is a wrong assumption. Most will have ANSI data that is associated with a corresponding locale (your situation). UTF8 and ASCII (the first 127) overlap perfectly so no issues when it comes to things like system table data cause they are all ASCII-limited (technically).

But when working with ANSI data the code page matters. The key is how the NONE (0 sql_sub_type) is created in the CharSet class.

        public Charset(int id, string name, int bytesPerCharacter, string systemName)
        {
            _id = id;
            _name = name;
            _bytesPerCharacter = bytesPerCharacter;
            _systemName = systemName;
            _isNone = false;
            _isOctets = false;
            switch (_systemName)
            {
                case None:
                    _encoding = Encoding.Default;
                    _isNone = true;
                    break;
                case Octets:
                    _encoding = new BinaryEncoding();
                    _isOctets = true;
                    break;
                default:
                    _encoding = Encoding.GetEncoding(_systemName);
                    break;
            }
        }

What if I gave you a way to indicate you want a different default encoding? That is unset and will continue to be UTF8, but if set will decode your ANSI data in the encoding you know it is. This solution would end up being an all-or-nothing at the reading of data level. So could potentially allow you to change between executions of reads.

A deeper dive of this code over the weeks has shown that the lc_ctype (character set) in the connection string will not impact anything other than if you can set it to utf8, because setting the character set at the connection level does not change the sql_sub_type so that will still result in using the default encoding which is always UTF8 in .NET. I am hoping to update how it decodes data to take into consideration of the connection's lc_ctype, but for now, that is not an immediate priority. UTF8 as a character set only works because the encoding for none is assumed to be UTF8. That assumption is hard-coded, so using others will not really work in the case where you have no character set defined for the column, but have non-ASCII data in that column.

Part of the trouble though is that IB can throw athematic overflow errors on conversions to a different character set. In these cases we know we want Latin1 (iso-8859-1) or SJIS etc, but the server will choke if we force the server to do it through the lc_ctype (this is the case with the other person running into this problem).

So if there is an option to not set the character set, retrieve the data as is, but define what we want the Encdoing to be for string data that is 'NONE' character set defined on the server side would that help you?

If we go that route I am thinking of both a new connectionstring property and a property on the IBConnection itself. That gives the flexibility of doing it once in the connection string as a "set and forget" type of situation while still giving the flexibility of switching it up as needed without needing to reconnect.

Sorry for the wall of text, hopefully, I wasn't too confusing.

jeffovercash commented 1 year ago

Ok update on this. I have changed the default sql_sub_type from assuming 0 is UTF8, to assuming 0 is your locale ANSI encoding. Reminder sql_sub_type 0 means no encoding (character set) is associated with the data. UTF8 works as long as your data is all ASCII range (0-127) above that it doesn't know how to encode from the locale ANSI to UTF8 (hence your weird characters). I am now loading the additional encodings from System.text...CodePages are needed for SJIS/ Latin1 and other potential ANSI encodings. and made sure it is supporting netcore 3.1 (note that is no longer in support by MS and at some point will be dropped from the driver)

I think this is going to resolve most of your conditions, but I am afraid I might need to give you an additional encoding override at the DBField level because you have mixed locales EN/JP JA/JP EN/US. If that is needed I know what I'll do, but if not I don't want muddying the waters more than necessary.

If you want message me and I'll give you instructions on how to pull the latest from my ProGet server and you can verify how it is working for you. I have another who was having this issue with Latin1 characters showing as that diamond ? and their issues are all resolved, but I'd like to also make sure your issue is resolved before we release anything.

Embarcadero / IB.NETDataProvider

Retrieving data from a ShiftJIS Interbase instance #18