Maximus5 / ConEmu

Customizable Windows terminal with tabs, splits, quake-style, hotkeys and more
https://conemu.github.io/
BSD 3-Clause "New" or "Revised" License
8.58k stars 573 forks source link

WriteConsoleW used with ConEmu duplicates Chinese characters output #945

Open Nelson-numerical-software opened 7 years ago

Nelson-numerical-software commented 7 years ago

Versions

ConEmu build: 161023 x64 stable OS version: Windows 10 x64 (1607) Microsoft Windows [version 10.0.14959] cmd

Problem description

WriteConsoleW duplicates chinese characters

Steps to reproduce

Actual results

Output: Traditional Chinese 漢漢字字

Expected results

Original string: Traditional Chinese 漢字

Additional files

build this code with VS 2015 C++:

include

include

int main() { std::wstring msg = L"Traditional Chinese 漢字"; HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE); WriteConsoleW(consoleHandle, msg.c_str(), msg.size(), NULL, NULL); return 0; }

Maximus5 commented 7 years ago

1) Why do you talk about the WriteConsoleW? Have you checked the result in the RealConsole by Ctrl-Win-Alt-Space?

2) Please run from ConEmu's prompt ConEmuC -checkunicode and show result here.

Nelson-numerical-software commented 7 years ago

1] It seems that it is also a bug of Windows 10 insiders 14959, 14965 With a Windows 10 stable version 1607 and same version of ConEmu 161023, it works .

2] Please notice duplicated characters 中中文文

ConEmuC -checkunicode ConEmu 161022 x86 OS Version: 10.0.14965 (2:) SM_IMMENABLED=1, SM_DBCSENABLED=0, ACP=1252, OEMCP=850 ConHWND=0x00090634, Class="ConsoleWindowClass" Console font info: 0, {3x5}, 54, 400, "Lucida Console" Handles: In=x8 (Mode=x1F7) Out=xC (x3) Err=x10 (x3) Buffer={131,1000} Window={0,0}-{130,35} MaxSize={131,166} Cursor: Pos={0,9} Size=25% Visible ConsoleCP=850, ConsoleOutputCP=850 CP850: Max=1 Def=x3F,x00 UDef=x3F Lead=x00,x00,x00,x00,x00,x00,x00,x00,x00,x00,x00,x00 Name="850 (OEM - latin multilingue I)"

123456789也也不不是是可可运运行行的的程程序序112233445566778899 Normal Reverse x7 x4007 Normal:x7 Reverse:x4007

Check AÀÀΑΑ╬╬豈豈AAꊠꊠ黠黠だだ➀ጀะڰЯ09 Text: AÀÀΑΑ╬╬豈豈AAꊠꊠ黠黠だだ➀ጀะڰЯ09 Read: A:x7 ÀÀ:x107 ΑΑ:x207 ╬╬:x107 豈豈:x207 AA:x107 ꊠꊠ:x207 黠黠:x107 だだ:x207 ➀:x107 ጀ:x207 ะ:x107 ڰ:x207 Я:x107 0:x207 9:x107 Blck: A:x7 ÀÀ:x107 ÀÀ:x207 ΑΑ:x107 ΑΑ:x207 ╬╬:x107 ╬╬:x207 豈豈:x107 豈豈:x207 AA:x107 AA:x207 ꊠꊠ:x107 ꊠꊠ:x207 黠黠:x107 黠黠:x207 だだ:x107 Info: 0,1,1,16,1,1,24,1

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦╦══ ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗║ 中中文文 ║中中 文文║╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ ╩╩════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ ══╝ Unicode check succeeded

Maximus5 commented 7 years ago

@miniksa Can you take a look at this? Reported already several times here.

miniksa commented 7 years ago

@Maximus5 I've filed it as MSFT:9751066 internally and assigned to myself. I'm currently in a deep thought on something else, so I'll probably get to it early next week. Thanks for the report.

miniksa commented 7 years ago

I see the issue. There appear to be duplicates coming out of ReadConsoleOutputW/A. I'm not sure what happened there. I'll have to keep investigating, but it looks like it will need a fix on our side once I figure it out.

Maximus5 commented 7 years ago

Perhaps this comes from changes in attributes processing. I noted some time ago (not sure where exactly) that new Windows build process high byte of console attributes "in proper and better way"... One of the most weird things in conhost is COMMON_LVB_LEADING_BYTE/COMMON_LVB_TRAILING_BYTE processing. It works differently on DBCS (Chenese/Japanese/...) Windows distros than on "European" distros. On DBCS systems, when certain CJK codepages are selected, each double-width glyph takes two (or more?) CHAR_INFOs (cells). That never happened on European distros, even if CJK support was installed and these codepages were selected in the console. I can't reproduce this issue on my test Win 10 boxes yet.

miniksa commented 7 years ago

FYI, I haven't forgotten about this investigation. We've just suddenly got slammed with e-mails and bugs from all sources and so getting to investigating this may take me significantly longer than I originally predicted. I will be back when I get a chance.

miniksa commented 7 years ago

FYI, the fix for this should have just landed with Insider Build 15014 today.

ncihnegn commented 7 years ago

Just tested Build 15014. Not fixed yet.

miniksa commented 7 years ago

Hmmm. Not sure what's up. I'll dig into character handling stuff today.

Maximus5 commented 7 years ago

@miniksa Finally I managed to install insider build.

First, the expected behavior from "stable" Win10 build. All glyphs are written and displayed properly, no doubled CJK and data properly fit on screen. 2017-01-25_11-08-09

Now the 15014.

2017-01-25_11-12-26

I'm still checking the results, here first notes.

  1. Regardless the fact SM_DBCSENABLED is 0, COMMON_LVB_LEADING_BYTE and COMMON_LVB_TRAILING_BYTE are set. Is that intended on non-DBCS enabled OS? There were not used previously, only CJK versions of Windows (up to Win 10 14393) used them.
  2. More worse that even conhost treats CJK glyphs in different ways.
    • Somewhere it shows them (by squares, yep) supposing they have double-cell width, somewhere - single-cell width.
    • When ConEmu writes 80 characters (the console width) on non-CJK Windows, the data is expected to be written properly without wrapping. But that's not true anymore. Even in conhost's window we may see that only 77 characters (I counted them) were written under the frame (the line with three CJK glyphs).

Finally. Here are drawing bugs during selection in conhost's window. I selected one by one cells with mouse. Cells have unexpected width during selection. And strangely the line below the selection is broken during selection. win10-selection

Maximus5 commented 7 years ago

@miniksa Inconsistency of API... WriteConsoleOutputAttribute, WriteConsoleOutputCharacter, ReadConsoleOutputCharacter, ReadConsoleOutputAttribute, ReadConsoleOutput... Some of functions treat CJK as normal single-cell glyphs (WriteConsoleOutputCharacter, ReadConsoleOutputCharacter). Some of functions return COMMON_LVB_LEADING_BYTE/COMMON_LVB_TRAILING_BYTE and therefore double cells (ReadConsoleOutputAttribute, ReadConsoleOutput). Some of functions has undefined behavior (after WriteConsoleOutputAttribute and further WriteConsoleOutputCharacter glyphs are "written" after filled with attributes cells). It's all on non-CJK insider Win 10.

miniksa commented 7 years ago

Yeah, I was finding bad behavior like this yesterday as well. Part of the deal is that it behaves differently with Raster Fonts vs. TrueType fonts as well. I'll probably be spending the rest of the week on trying to fix this up and make it consistent. I don't know what SM_DBCSENABLED is/does. Console's DBCS check has always been based on the active code page (is equal to 932, 949, 950, 936) not that system metric.

I'll try to keep you posted as I figure this out. Sorry about that. A few of us have been working on trying to fit UTF-8 support into the console (not done yet) and it appears to have messed up quite a few DBCS routes.

Maximus5 commented 7 years ago

I used to check GetSystemMetrics(SM_DBCSENABLED) which actually was 1 only for Windows installations developed for China, Japan, Korea (CJK). If SM_DBCSENABLED returns 0 that meant that CJK glyphs use only one cell in conhost, regardless of the codepage. That was true before. Now it is broken or changed. What is correct behavior?

miniksa commented 7 years ago

I'll have to get back to you on that. Everything you are telling me about SM_DBCSENABLED is 100% new information to me. I don't really know if that particular metric used to be a part of the console code in XP/Vista/7/8. I can look. I also don't know what in the system turns that metric on or off.

From what I know about the console from Win 8.1 to today, the console always did its conversions and width calculations based on code page. It's just that prior to recently, it used to prohibit changing into a CJK codepage unless your system's non-Unicode region was set to a CJK language (Control Panel-->Region-->Administrative-->Language for non-Unicode programs). I've been trying to remove that restriction to allow anyone to swap into any codepage no matter their "non-Unicode region" because in today's editions of Windows (as opposed to the CJK-specific ones of the 1990s), you can add just about any language pack and IME and font to any language edition of Windows, so the "non-Unicode" region doesn't really matter like it used to several decades ago.

My plan is:

miniksa commented 7 years ago

So I've got through 1, 2, and 3 in MSFT: 10187355 which is checked in and will start shipping up to Insiders builds. Probably be there in a few weeks. I've basically restored the console's behavior to the same as what it was for the legacy console. If it works against the console with the legacy box checked, it will work again against the updated one once the Insider build updates.

For part 4, I'm still working on it. I basically need to write up the way that the v1/legacy console did it and publish that.

rprichard commented 7 years ago

@miniksa @Maximus5 FWIW, this VSCode/winpty issue seems related: https://github.com/Microsoft/vscode/issues/19665. ConEmu is broken in exactly the same way (screenshot in this comment, https://github.com/Microsoft/vscode/issues/19665#issuecomment-287248500). I wrote a test case demonstrating the new (broken?) behavior as of Win10 v15048.

bao-qian commented 7 years ago

hi I have no such problem in previous windows build (15063.413) for simplified Chinese. I only noticed such issue after latest stable windows build 15063.447 rolled out: alpha build works almost fine with new console.

image

stable and preview build works find with legacy console

image

faiz-lisp commented 4 years ago

I try Chinese on the UTF8 version of Newlisp. https://github.com/kosh04/newlisp/blob/develop/nl-utf8.c It works well.

https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80 (I hope it could help.)