TaoK / PoorMansTSqlFormatter

A small free .Net and JS library (with demo UI, command-line bulk formatter, SSMS/VS add-in, notepad++ plugin, winmerge plugin, and demo webpage) for reformatting and coloring T-SQL code to the user's preferences.
http://www.architectshack.com/PoorMansTSqlFormatter.ashx
GNU Affero General Public License v3.0
969 stars 268 forks source link

A tilde (ã) and umlaut (ä) characters incorrectly treated in Notepad++ plugin #160

Open TaoK opened 7 years ago

TaoK commented 7 years ago

Felipe Gualberto points out that Portuguese a tilde-carrying characters are being mangled by the Poor Man's T-SQL Formatter plugin in Notepad++:

select * from Mão where Name = 'Felipe'

becomes

SELECT *
FROM M[xC3][xA3]o
WHERE NAME = 'Felipe'

Where the square brackets indicate Notepad++'s special "unrecognized binary sequence" white-on-black formatting.

Interestingly, other international characters seem to work fine, as in these Arabic and Chinese examples:

SELECT *
FROM العَرَبِيَّة
WHERE NAME = 'Felipe'
SELECT *
FROM 漢字
WHERE NAME = 'Felipe'

There is definitely nothing special about ã in any Poor Man's T-SQL Formatter code, so this suggests there may be an issue in Notepad++ or Scintilla causing this behavior...?

FelipeCostaGualberto commented 7 years ago

Thanks, Tao. I confirm this in all pt-BR system I have with default settings and installation of the plugin.

TaoK commented 5 years ago

It looks like #217 has a possibly-more-complete description of probably-the-same-issue

TaoK commented 5 years ago

I'm removing the "duplicate" label, because I can reproduce this and I can't reproduce the issue reported in #217, but I do get something useful from that other issue: The issue only occurs if the document is not set to "Encode in ANSI". As far as I can tell, all other encodings produce the issue reported, but "ANSI" does not...

FelipeCostaGualberto commented 5 years ago

The issue only occurs if the document is not set to "Encode in ANSI".

That was really helpful, thanks Tao!

TaoK commented 5 years ago

I've made some progress in understanding what's been happening here, although it seems like a major mess.

It looks like the C++ to .Net interop machinery in use here, when Scintilla ends up feeding a buffer into a .Net StringBuilder, doesn't put unicode characters into the resulting stringbuilder, but rather bytes.

Most of the time no-one notices, because the formatter only "reacts" to simple ANSI characters (same as byte or unicode UTF-8 sequence), treating all the rest rather simply/naively, and most importantly when these nonsense-strings are fed back into Scintilla, it interprets them as byte sequences, and everything "washes out".

This mess happens in formatSqlCommand() in PoorMansTSqlFormatterNppPlugin/Main.cs, and I'm working on it. Been distracted by some Visual Studio issues over the last couple of days, but coming back to it now.

(to be clear: I don't know whether it's scintilla misbehaving here, or the NPP .Net plugin bridge, or just something stupid that I personally am doing in this code.)

mvbentes commented 3 years ago

I came to report that this was happening to me, but found this open issue. Adding my me too!

Thank you for all the effort put into this great tool!