Open StevanWhite opened 7 years ago
I tried it also with the beta version ConsoleZ.x64.1.18.2.17256, with similar results.
I selected the text in the window and copy-pasted it into Notepad++. Attached is the resulting file. The unknown-character and dotted-circle glyphs are evident.
Hey. Is the text being chopped up into buffers before being fed to the underlying rendering layer? That would have an effect like what we're seeing. If the text has to be be buffered, it must be broken at white space!
ConsoleZ displays chars from console buffer. Generally, a strange character is present in the font or in the console buffer.
Here, it seems cmd.exe and 'type' command have a curious behaviour. With PowerShell and 'Get-Content udhr_hin.txt', the result seems correct. Can you confirm?
Hi, Christophe,
OK it may be some environment problem. I still don't know what.
To view the file with PowerShell, but I have to specify the encoding: Get-Content -Encoding UTF8 udhr_hin.txt otherwise it produces 8-bit garbage. This means something is different between your setup and mine.
With it, the results look much better, but still not perfect.
And it's strange: the display has a glitch, only in the very last line of the file! However, when I select the console contents and copy-paste it into the text editor, the glitch is gone. See attached -- there are dotted circles on the right-hand side.
What could this be? (Bear in mind, I have never had any luck displaying complex scripts in Windows consoles, so I have no -- good -- experience to draw on.)
A colleague in India is telling me that on her system, it looks good. I'm looking into that.
My colleague gets perfect display from the DOS 'type' command. I have started from the beginning, experimenting with 'chcp'. It doesn't matter in PowerShell -- only the -Encoding switch makes a difference.
We are both using Windows 10, but mine is a German system, and hers -- It's English, near as I can tell.
Can you recommend a command that would list relevant environmental settings?
The mail or something removed the image of my console, showing the current state on my machine. I said there are dotted circles on the right -- I meant the other right -- the left side.
I hope you have some idea how we can debug our settings -- I think it would be a good FAQ topic.
dotted circles appear when you cut a word (line return)
My colleague gets perfect display from the DOS 'type' command.
So you have more knowledge than me to resolve this issue. Your colleague use ConsoleZ or not?
Check the Windows Console font used by your colleague
View/ConsoleWindow
to show Windows console. Right click on Windows console caption --> Properties
--> Font
tabProperties
--> Font
tabWhat I can see by googling:
Hi Christian,
I am working on this, and I have more to report, but a couple of things aren't clear to me. I'll write about that separately -- here I'll just answer your remarks and questions.
" dotted circles appear when you cut a word (line return) "
Not exactly. They happen if you cut a word before a mark character which must be applied to the character that precedes it. And this is the root of the remaining problem: some software layer is incorrectly breaking up the text before the font rendering layer gets it. What is unclear to me, is which software layer is responsible for the problem.
" Your colleague use ConsoleZ or not? "
We had a miscommunication. When she saw perfect Hindi output, she was looking at the Windows 10 bash console -- which works perfectly. When she uses ConsoleZ, she gets the same results that I have seen.
" What I can see by googling: "
" you cannot type Hindi without Unicode " Unicode was invented as a means of supporting all writing systems. The Devanagari writing system used by Hindi is one of those.
" Hindi has no ANSI codepage " Not by itself. The correct codepage to use for Hindi and other Indic languages is 65001, UTF-8.
" there is no monospace font " As you can see in my images, we do have a monospace font that supports Devanagari. I can give you access to it, if you like.
But note: our main problem that we see ConsoleZ is independent of the font. This is a software bug, either in ConsoleZ or in Windows software.
Here is a shorter text page, just the last section of that UDHR file udhr30.txt
Here is how it looks using the Windows bash console on the same system (same font). There are no dotted circle here.
But notice the text output routines are automatically wrapping this text, so that no word is by the line endings.
What layer is responsible for this?
That same text in the same ConsoleZ window, using PowerShell and bash.
Note that the results are identical -- they both break the text at the wrong point. This means that the PowerShell 'get-content' is not to blame for this problem.
We had a miscommunication. When she saw perfect Hindi output, she was looking at the Windows 10 bash console -- which works perfectly. When she uses ConsoleZ, she gets the same results that I have seen.
Clarification
When I run cmd in consolez, and type the text, it gets some characters in error.
When I run bash from WSL in consolez, and cat the text, it displays ok
So diff could also be between type and cat.
I will check whether error happens in exact same place and check the unicode points of those words.
I tried copying the words getting the error and looking at their codepoints in https://r12a.github.io/apps/conversion/ I find no difference, so probably the copy and paste method is dropping the offending character.
I tried looking at same text file in windows console directly in windows 10 and then also looked at the same via consolez in windows 10. Here are the results with images attached.
windows console - couriernew - indic text as boxes
windows console - with Steve's monospace font - input text rendered in devanagari - but the positioning of combining marks is incorrect
windows console run via consolez - same indic text - rendered correctly but certain characters show up as ? / diamonds/ etc.
powershell run via consolez - indic text displays correctly
WSL bash under windows - run via consolez - indic text displays correctly
In the above post , where I have said indic text is rendered correctly, I have not included the 'dotted circle' issue - that can probably be avoided by adding line breaks at word boundaries rather than in middle of word.
Hi, Shree, A few explanations of the effects you showed. There are several things going on, and I am just now coming to understand them myself.
1) The Courier New font has no support for Devanagari. The boxes you see are normal behavior when the glyph is not found. 2) In Windows terminals, the default encoding may not be UTF-8. When the encoding is set to some other value, wrong characters will appear. Some of your examples show Western letters being displayed instead of Unicode Devanagari-- the chcp 65001 command fixes that, setting the encoding to UTF-8. 3) I suspect that the old DOS commands such as "type" do bad things to multi-byte (e.g. Unicode) text. This is where you see the diamond-question (Unicode "replacement", U+FFFD) character. 4) The dotted circle appears when a mark character has failed to apply properly to a preceding character. This will happen in normal Hindi text if words are somehow broken at the wrong point, either by some coding oversight (incorrect buffering), or by a poor algorithm for wrapping text.
Christophe, I want to say, you are almost there, as far as display of Indic text is concerned.
There are two things for you to consider for your product.
1) encoding
The default behavior of Windows consoles regarding encoding is widely considered to be a bug. For instance: https://stackoverflow.com/questions/22349139/utf-8-output-from-powershell
Our systems are already set for UTF-8 locale, we have set chcp 65001, and yet still in PowerShell, I had to explicitly set the encoding of Get-Content in order to obtain the right output.
Once the encoding is set, as with the default system locale, it should just work. There should be no need to re-state the encoding, as we have done. If you can figure out what is the right thing to do, you will have a terminal that is very superior to anything else on Windows.
2) Line wrapping of Indic text
For the purpose of reading text, it is very bad to show the normally-hidden characters used to compose Indic words, or to mis-place vowel marks. Both of these are happening when you split a word at the edge of the screen.
There are a couple of options.
i) break lines only at white-space.
This is an easy solution.
For display of normal text, it is maybe the best-looking option. It wreaks havoc in applications that don't expect this behavior. (E.g. text editors such as vim.)
I would suggest an application option for white-space line wrapping. That would settle the problem for most users.
ii) in a monospace environment, break after the last rendered character.
I haven't seen this done. If it's possible at all, this would be the preferred solution for some purposes.
There may be special font-rendering API calls that facilitate this. If not, there may be an algorithm for finding the best place to break a word. It would take some thought -- although I can imagine a couple of ways this could be done.
Cheers!
Our systems are already set for UTF-8 locale, we have set chcp 65001, and yet still in PowerShell, I had to explicitly set the encoding of Get-Content in order to obtain the right output.
Your file has no BOM, then there is no indication concerning the encoding. By default, Get-Content will choose System.Text.Encoding.Default (aka operating system's current ANSI code page).
Once the encoding is set, as with the default system locale, it should just work.
You should contact Microsoft to debate this. PowerShell is based on .Net and is Unicode only. chcp configures code page for ANSI applications. Why .Net uses operating system's ACP instead of current console ACP as default encoding? I don't know.
ConsoleZ reads from the Win32 console buffer. Console applications (shells or not) write into the Win32 console buffer.
Win32 console buffer is an array (column/line) of characters (with attibutes as color) : there is no indication of line breaking. Writers are responsible for break lines and encoding.
Hi Christophe,
I understand your objections. But let's distinguish between implementation details and how the thing should work.
We are aware of the BOM issue. That is a non-standard Microsoft-only convention. The indication of the encoding ought to be taken from a default, which should be somewhere in the environment. This is failing to happen. Whose fault it is -- is immaterial to the user. The question is, can you do anything to improve the situation? That is still not clear to me.
The encoding functionality is now broken -- it is not just my opinion. It it causes a lot of trouble for console users generally, and it impacts your users too. How it should work is, the default encoding should be set when the console app is launched, and cmdlets, or programs, launched within that should inherit the encoding. Maybe it is really impossible to do anything about it at the console level. If so, it would be helpful to document that somewhere, together with recommended work-arounds. (Documentation alone would set you apart from most of the competition!)
The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users.
Tell me: what software is responsible for breaking the line at the window edge? The "writer"? The font rendering layer? Or ConsoleZ? That software layer is the only one that can be responsible for breaking the text at a reasonable point.
That is a non-standard Microsoft-only convention.
Microsoft did not invent the BOM. Unicode standard does. I don't understand how a setting in environment could guaranty that all files you will read are encoded in UTF-8.
The question is, can you do anything to improve the situation?
As I explained : nothing. ConsoleZ doesn't read the file. ConsoleZ reads the Unicode characters from WIN32 console buffer. These characters are written by shell commands or console applications.
If you want an overview of how Windows console works, you can consult MSDN WIN32 console functions documentation.
The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users.
No this is not a detail. This is how ConsoleZ works and this is the only way to interact with Windows console.
Tell me: what software is responsible for breaking the line at the window edge?
A console has a fixed number of columns. When you have filled a line, the next character is written at the beginning of the next line. This is the same rule for every languages.
As you can see in my images, we do have a monospace font that supports Devanagari. I can give you access to it, if you like.
Yes, I would like to have the name or a link to a monospace font that supports Devanagari.
Hi again Christope,
I'm afraid we're misunderstanding one another on several points.
That is a non-standard Microsoft-only convention.
Microsoft did not invent the BOM. Unicode standard does.
I didn't say Microsoft invented the BOM. I meant, only Microsoft products use the initial BOM to indicate encoding. It is not a practice recommended by the standards. https://en.wikipedia.org/wiki/Byte_order_mark
I don't understand how a setting in environment could guaranty that all files you will read are encoded in UTF-8.
It does not. But we aren't talking about guarantees. We were talking about the "default" encoding.
The effect, for example in unix-y systems, is that plain text with no other indication of encoding is treated as if it were of the default encoding. For example, most unix-y systems these days use UTF-8 as the default encoding. If I open most text files on my system with a text editor, they assume that encoding, and display the text well.
Of course, sometimes I get a file that isn't UTF-8, and I have to figure out the encoding, and tell the editor explicitly what encoding to use. But that is beyond the scope of a console -- rather, what you call a "writer" would have to be told to use a non-default encoding.
The question is, can you do anything to improve the situation?
As I explained : nothing. ConsoleZ doesn't read the file. ConsoleZ reads the Unicode characters from WIN32 console buffer. These characters are written by shell commands or console applications.
I never said or thought that ConsoleZ reads the file. I'm afraid you've missed my point.
The idea is to somehow instruct the shell commands/applications to use the default encoding, before they write. This has nothing to do with the buffer -- by that time, it's too late. In unix-y systems, the default encoding is set in the environment, and it works right. I understand that Windows is much more complicated in this regard -- and maybe it is simply broken, from what I read. But that doesn't mean there is no way to get around it.
I only ask that you try to understand the problem, and keep your mind open to any solution that you come across.
If you want an overview of how Windows console works, you can consult MSDN WIN32 console functions documentation.
Well maybe this is the root of our misunderstanding. All the Microsoft consoles are broken regarding encoding. That does not mean that all console programs that run in Windows must be broken.
Depending on how much of the Windows API's your software uses, you might inherit the same problems that the Microsoft programs have. That is a matter of the programmer's choice.
The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users.
No this is not a detail. This is how ConsoleZ works and this is the only way to interact with Windows console.
I'm afraid I have failed to communicate what I mean by "implementation detail". I'll try again.
Your user wants to use your product to look at an Indic-language file. Once the issue of encoding is settled, they do see some Indic text, but there's junk in it. They know how it should look, but that's not what they see.
There are two questions:
1) is it at all possible to display the text so it can be properly read? 2) do you want to go to the trouble?
When you talk about "Win32 buffer" and "Windows Console", you're talking about the particular APIs and services on the system that you employ to display the text. Of course, as a programmer, you could choose to use those APIs and services, or you could write the whole thing from scratch. These are programming decisions, and that's what I mean by "implementation detail".
You write as though you identify "ConsoleZ" with this particular mechanism for displaying text. Well, it is your product, and you can define it as you like. But this does not mean it is impossible to display the text some other way.
If you insist that ConsoleZ is a program that displays text on the screen using certain system APIs and services, you may indeed be stuck. On the other hand, if you shoot for the goal of displaying text as the user would like it, you may have to abandon some ways of doing things. Some people would love a challenge like that, others would prefer to stick with what they have. It's a choice.
You might also double your number of users. (Which brings up another question: do you want more users?)
Tell me: what software is responsible for breaking the line at the window edge?
A console has a fixed number of columns. When you have filled a line, the next character is written at the beginning of the next line. This is the same rule for every languages.
I suggested that your product could be given a mode, in which it does something smarter than that, for example, to optionally break the incoming text at the last white space, and shove the remaining text on the next line. (There are other possibilities, but as I said, this is the simplest.)
This isn't a matter of possibility -- of course a programmer can find a way to achieve that effect -- it's a matter of your choice. If you want to do it, and if you are a programmer, you can do it.
As you can see in my images, we do have a monospace font that supports Devanagari. I can give you access to it, if you like.
Yes, I would like to have the name or a link to a monospace font that supports Devanagari.
I will send you a link.
Thanks again!
I meant, only Microsoft products use the initial BOM to indicate encoding.
false
For example, most unix-y systems these days use UTF-8 as the default encoding.
This is not a reason to presume that all files are UTF-8. Even under unix-y systems, console's encoding settings indicates only to an application what it should expect when reading input data and what it should write in stdout/stderr. That's all.
Windows is an UNICODE OS (strings are encoded in UCS-2 - a subset of UTF-16 where each character takes 2 bytes in memory). When a text file is read by an application, file's content must be converted to UCS-2.
Unix-y systems are mostly C char
encoding. A text file can be read like a binary file. Most applications will process data without worrying about file's encoding.
Well maybe this is the root of our misunderstanding. All the Microsoft consoles are broken regarding encoding. That does not mean that all console programs that run in Windows must be broken.
I think you presume Unix-y system are best and you critic Windows without understanding how they works. I explain only facts. There are many reasons to critic Windows and Microsoft. But there are reasons to critic Unix-y system too :wink:.
There are two questions:
1) is it at all possible to display the text so it can be properly read? 2) do you want to go to the trouble?
1) If you use a good font and Windows Unicode applications, it will be properly read. But by console design (a fixed array column/rows of characters) line returns are unavoidable. To avoid line breaking, application which writes output should know what is a word, verify if there is enough space to write the word, go to line if not... The dotted circle is automatically drawn by the glyph rendering API (aka system drawing text API). I am not sure to understand your problem. In a console, all non ideogrammic languages have word breaking. Nobody complains. If you see a real problem, be factual and clearly explain it.
2) In fact, I spent my ConsoleZ weekly quota of time to read your latest commit. Every time I read a long comment, a new feature cannot be coded...
You might also double your number of users. (Which brings up another question: do you want more users?)
I have contributed to this project because I used it a long time and I need more features. In GPL spirit I provide my modifications freely to all users. It's free, really free: no ads not paid to download ... I do not earn anything, I just lose time if the feature is useless to me. I'm more interested in doubling the number of contributors than users.
I'm afraid you have misunderstood me on almost every point. I am sorry that I have been unable to explain the basic details to you.
I do not mean to criticise you personally, and I only point out that the problem of encoding in Windows console environments is widely considered to be buggy. I provided references -- this is not only my idea.
I too work with Windows professionally, as well as Linux, and I've been using both almost since they first came out. I'm a long-time programmer, too. Thirty years ago it was C, then C++ then Java, and meanwhile a lot of Perl and Python and so on and so on. I do know something about these things.
Most files on unix are "C" -encoded -- well that isn't really an encoding -- it basically means the file is treated, as you said, as binary. That is surely true if you're talking about system files. But for users whose language can be encoded only with Unicode, will have their default encoding set to Unicode, and most of their own text files will be Unicode-encoded, so that the textual context can be interpreted as text in their spoken language. This is the situation I've been talking about.
As we have demonstrated to you in images, Hindi text is mangled in one way or another by your program or by any other Windows terminal emulator that we can find. It is not right, sometimes so bad it's unreadable. It is possible to make a terminal emulator that does work well for Hindi and other complex scripts. It would take some thought and programming effort however.
I applaud you for your time and unpaid effort on this free project. And I respect your decision not to spend time on issues that do not interest you. Thank you for your time!
(Unicode is not a text encoding, it's a consortium providing computing industry standard. It manages UTF-8, UCS-2, ...)
I have seen only one real bug in all your comments:
type
command (cmd.exe) produces some garbage in Windows console buffer.
I will ask Microsoft.
(I guess text conversion is made per block. UTF-8, Characters don't have a fixed size. Then chars overlapping two blocks could be broken.)
I don't understand why so much mystery to indicate where to find a monospace font that supports Devanagari...
Hi StevanWhite,
I can't be able to open up a hindi text file in command prompt in windows 10 .
Even after running a command: chcp 65001(for utf-8), the rectangular boxes (as shown in the attached screenshot) are displayed instead of the Hindi text.
Please suggest a solution. Thanks in advance.
Hi angtany,
First, the Windows console is an old thing, just not set up for display of Indic scripts. Other console applications exist, which do a better job. None that I have found is perfect. (All I have tried screw up when a word is wrapped at the wrong point.)
I found the Gnome program Konsole on Linux works pretty well: a Windows program called ConsoleZ works pretty well on Windows. (Also, if you want a text editor, Notepad ++ works pretty well.)
Second, there are several things that have to be right.
You need a monospaced Devanagari font. There aren't many. FreeMono, as in the GNU FreeFont SVN, will work. Install the font, and arrange for the console application to use it.
In the console, you have to change the code page chcp 65001
In PowerShell, you have to do other things to get Unicode to properly output to the console, Get-Content -Encoding UTF8 filename.txt
Let me know if that helps!
On Fri, Jul 6, 2018 at 10:34 AM, angtany notifications@github.com wrote:
Hi StevanWhite, I can't be able to open up a hindi text file in command prompt in windows 10 . Even after running a command: chcp 65001(for utf-8), the rectangular boxes (as shown in the attached screenshot) are displayed instead of the Hindi text. [image: screenshot 36] https://user-images.githubusercontent.com/39870291/42368376-80d7a4f0-8124-11e8-81a1-6485d1f63eed.png
Please suggest a solution. Thanks in advance.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cbucher/console/issues/453#issuecomment-402966555, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgNub4HEJxrp_6CPsavIh7AkwuYZEV0ks5uDyEHgaJpZM4PmAAc .
In the display of Hindi text (in Devanagari script), I'm encouraged to see that much of the complex reordering of letters (Indic font shaping) is being carried out.
There are a couple of ugly bugs however. They both seem to occur randomly in the text. The effect is independent of the font being used for display (provided a font that provides Devanagari is installed!) I have been unable to identify anything in the text itself that consistenly triggers the problem -- the same word will look fine here, but show trash there.
This is not a matter of simple encoding -- otherwise no Hindi would appear at all. It appears to be some glitch in the software. udhr_hin.txt
A lot of people are looking for a good replacement for the system consoles that displays Indic text correctly. And there aren't many options out there.
Expected Behavior
Nice clean Hindi text to be displayed
Actual Behavior
Most of the text looks pretty good. But there are glitches throughout. I see two kinds:
1) a sequence of two or three U+FFFD ("replacement character") glyphs appear, sometimes between words. I can only imagine this is some sort of bug.
2) a U+25CC ("dotted circle") appears, often within a word. Sometimes it appears to replace a letter... a letter which is displayed just fine in the next word.
The attached image is a screen shot of the end of the attached text file
This word appears several times under the heading अनुच्छेद ३० किसी In its second occurrance, there is a dotted circle before the i-glyph. but the i-glyph is in the right place.
Under अनुच्छेद २९. in part (१) the word का gets a couple of unknown characters before it. Don't see anything else broken. In part (२) the word प्रजातन्त्रात्मक is broken: replaced by unknown characters, looks like the letter त is lost.
Steps to reproduce
Make sure some fonts are installed that support Devanagari, copy the attached file udhr_hin.txt to the main user directory, open ConsoleZ. chcp 65001 type udhr_hin.txt
Diagnostic Report
When reporting a bug you must provide a diagnostic report. If you are not able to create a diagnostic report, explain why. Privacy is not a valid explanation! The report is human readable and private data can be masked.