Maximus5 / ConEmu

Customizable Windows terminal with tabs, splits, quake-style, hotkeys and more
https://conemu.github.io/
BSD 3-Clause "New" or "Revised" License
8.6k stars 573 forks source link

4 bytes unicode char are not handled properly. #2128

Open deadalnix opened 4 years ago

deadalnix commented 4 years ago

Versions

ConEmu build: 200604 x64 OS version: Windows Windows 10 19041 x64 Used shell version (Far Manager, git-bash, cmd, powershell, cygwin, whatever): cmd

Problem description

When using unicode character that are 4 bytes in size, ConEmu seems to corrupt the output in some way. Not only the character isn't displayed properly (which I don't really care about, tbh) but copying anything that contains such character result in corrupted data in the clipboard.

Interestingly, open the "real" terminal, via ctrl+win+alt+space show that it is also unable to display the character, but copying from it get the right data in the clipboard.

This seems to indicate that the problem isn't actually displaying the character in the case of ConEmu, but something deeper going on.

Steps to reproduce

Some bash utilities are handy to demonstrate the problem, so I will use wsl's bash, but the problem exist for he regular command line, powershell or anything.

$ bash
$ printf 🏃 | hexdump
0000000 9ff0 838f
0000004

Now, if we copy the command in the clipboard and paste in anywhere, we can see that the unicode character was corrupted like this: printf � | hexdump

While the character are not displayed properly, this works as expected on the "real" console.

Pasting back into the shell, we get the following output:

$ printf � | hexdump
0000000 bfef 00bd
0000003

This happens with any 4 byte character, shorter characters seems to be working just fine.

For instance:

printf ☠ | hexdump
0000000 98e2 00a0
0000003

Copying and pasting works properly, and the character is also displayed properly.

Actual results

4 bytes unicode character are corrupted before being displayed, which prevents proper display as well as copying and pasting.

Expected results

4 bytes unicode char should not be corrupted and copy/paste should just work(tm).

Bonus point if they can be displayed properly, but this is an entirely different problem and may very well just work as soon as the right bytes are pushed onto the buffer on screen.

Additional files

To make sure nothing interfere, I made a new ConEmu install and left the config by default.

Maximus5 commented 4 years ago

Copy/paste from html page is not reliable. Have you tried the same with powershell? Running standalone, without ConEmu?

First thing I noted, is that in your third example (☠) there is 00a0 (line feed). First and second has not line feed character. So I'm not sure at all where is the problem.

I need some more precise and reproducible tests.

deadalnix commented 4 years ago

The native console doesn't not display the chars properly, but it does copy/paste properly.

Considering we are communicating over web - and yes, that is unreliable - what do you suggest we use for me to be able to provide you reproducible steps?

PS: thanks for looking into this.

Maximus5 commented 4 years ago

I did some investigation and for me it looks like a bug in Windows console (conhost). Simple test attached, run it from console via pwsh.exe -command print-unicode.ps1.

print-unicode.ps1.zip

And output looks like

image

If I try to paste the glyph "🏃" into native console prompt there is even more mess

image

Windows 10 1909 (10.0.18363.836)

Maximus5 commented 4 years ago

I consider this is a bug of Windows which ConEmu can't mitigate itself.

sample C++ test, only WriteConsoleOutput function works properly

#include <windows.h>

int main()
{
    const auto hOut = GetStdHandle(STD_OUTPUT_HANDLE);
    const wchar_t writeConsole[] = L"WriteConsole: --\xD83C\xDFC3--\n";
    const wchar_t writeConsoleChars[] = L"WriteConsoleCharacters: --\xD83C\xDFC3--";
    const wchar_t writeConsoleBuffer[] = L"WriteConsoleBuffer: --\xD83C\xDFC3--";
    CONSOLE_SCREEN_BUFFER_INFO si = {};
    DWORD written;

    WriteConsoleW(hOut, writeConsole, wcslen(writeConsole), &written, nullptr);

    GetConsoleScreenBufferInfo(hOut, &si);
    WriteConsoleOutputCharacterW(hOut, writeConsoleChars, wcslen(writeConsoleChars), si.dwCursorPosition, &written);
    ++si.dwCursorPosition.Y;
    SetConsoleCursorPosition(hOut, si.dwCursorPosition);

    CHAR_INFO bufferData[80] = {};
    for (size_t i = 0; writeConsoleBuffer[i]; ++i)
    {
        bufferData[i].Char.UnicodeChar = writeConsoleBuffer[i];
        bufferData[i].Attributes = 7;
    }
    const COORD bufSize = {wcslen(writeConsoleBuffer), 1};
    const COORD bufCoord = {};
    SMALL_RECT writeCoors = {si.dwCursorPosition.X, si.dwCursorPosition.Y, si.dwCursorPosition.X + bufSize.X - 1, si.dwCursorPosition.Y};
    WriteConsoleOutputW(hOut, bufferData, bufSize, bufCoord, &writeCoors);
    ++si.dwCursorPosition.Y;
    SetConsoleCursorPosition(hOut, si.dwCursorPosition);

    return 0;
}

output

WriteConsole: --�--
WriteConsoleCharacters: --�--
WriteConsoleBuffer: --🏃--

PS. In theory, the problem could be mitigated after switching to PTY API.

Maximus5 commented 4 years ago

@miniksa, @zadjii-msft could you please check the problem from your side?

deadalnix commented 4 years ago

Thanks for the investigation!

I had to modify my workflow on my hand to work around that problem. It's not ideal, but it's livable. Is there a way to switch to the PTY API on my end? Or is it something that would require important refactoring on ConEmu's end?

PS: while the native console display garbage too on my end, I can copy from the console and paste somewhere else, and it paste the right stuff. Not sure by what magic this happens, but that would be a great usability plus for me if that would work.