atom / encoding-selector

Pick an encoding for the current editor
MIT License
21 stars 26 forks source link

File encoding bug in Atom 1.3.3 when switching between UTF-8 and 'Western (Mac Roman)' #28

Closed lee-dohm closed 8 years ago

lee-dohm commented 8 years ago

From @afischli on January 23, 2016 12:22

Repeat the bug:

1) Open a new file and save it as 'test-encoding.txt' 2) Switch encoding via Status-bar to 'Western (Mac Roman)' 3) Enter some Umlaute, e.g. 'äöüÄÖÜ' and save the file 4) Switch to 'UTF-8' encoding 5) Above entered string is transmogrified to '������' and becomes unreadable 6) Enter the same string (now in UTF-8) encoding and save the file 7) Switch back to 'Western (Mac Roman)' encoding 8) UTF-8 entered string correctly displays as '√§√∂√º√Ñ√ñ√ú' but originally entered string is corrupted

Bug: Originally with 'Western (Mac Roman)' encoding entered string 'äöüÄÖÜ' should now be restored as originally entered (consistent with behavior of Atom in steps 6..8). However, the string changes to unreadable wrong sequence 'ÔøΩÔøΩÔøΩÔøΩÔøΩÔøΩ'. Perhaps this is a bug, which is related to Atom not understanding the difference between iconv options 'MACROMAN' and 'UTF-MAC' encoding.

I found out later, if you repeat step 4) and 7) (file saved and left unchanged since step 6) you regain the originally entered string. The bug appears only every 2nd time of switching. Since this is quite confusing I believe this is a bug, which should be fixed.

System: Atom 1.3.3, OS X 10.9.5 (Mavericks)

Copied from original issue: atom/settings-view#728

lee-dohm commented 8 years ago

I believe that this is a side-effect of #7 not existing and therefore this may be a duplicate of #7.

In essence, Atom doesn't convert between encodings, the encoding-selector only tells Atom how to interpret the character codes that already exist. Since the encoding for the characters in step 3 result in invalid UTF-8 sequences (the hex values for them are 8A 9A 9F 80 85 86), upon saving the file in step 6 the characters are each changed to the hex sequence EF BF BD which is the UTF-8 encoding of U+FFFD, the Unicode replacement character.

used to replace an incoming character whose value is unknown or unrepresentable in Unicode

The reproduction steps can be simplified to:

  1. Open a new file and save it as test-encoding.txt
  2. Switch encoding via the encoding-selector to "Western (Mac Roman)" (most single-byte character encodings will reproduce this behavior)
  3. Enter some Umlaute, e.g. 'äöüÄÖÜ' and save the file
  4. Use the hex package to examine the contents of the file
  5. Switch to "UTF-8" encoding
  6. Save the file
  7. Use the hex package to examine the contents of the file

Since the encoding-selector currently is only intended to inform Atom of the encoding being employed by the file and not to convert between encodings, it seems as if this is the best possible behavior. If the encoding-selector is given the ability to convert between encodings as recommended in #7, the behavior should end up more to your liking @afischli.

Please let me know if I've misunderstood things.

afischli commented 8 years ago

I fear you may have complicated things unnecessarily. While I fully concur that Atom does and should not convert between encodings, some of the statements you make, need some amendments.

1) No this is NOT a duplicate of #7. What I report on is not a request for a new behavior of Atom, i.e. a feature that would convert among encodings. I report on what I believe is a bug, since Atom behaves inconsistently, i.e. it does convert partly, where it should not and thus creates a mess.

2) You write "Since the encoding for the characters in step 3 result in invalid UTF-8 sequences (the hex values for them are 8A 9A 9F 80 85 86)". I wish to correct/improve on that, since your argument sounds a bit like the behavior of Atom would be justified ('Since ..."). Please note, I believe Atom has no "right" in the first place to change anything of the content of the file, i.e. the hex sequence.

3) The Umlaute entered in "Western (Mac Roman)" have nothing to do with UTF-8 and need to be stored as entered.

4) This also means that saving should actually not matter, since saving should always (remember without any conversion feature) not change anything in the byte sequence (identical, unchanged hex sequence). Remember, encoding means only wanted interpretation of the hex sequence.

5) Switching encodings must not change anything. The very same, unchanged hex sequence is either displayed according to the encoding correctly or if the encoding does not match what's in the file, then the display may be "wrong" from a human's perspective, i.e. makes no longer sense. But upon switching back to the original encoding, Atom should always be able to restore what you originally entered.

6) This is clearly not the case with current Atom. But this is demonstrated only by my sequence. Yours you suggest fails to go back to "Western (Mac Roman)" encoding, the essential test. There is no need to ever having to know anything about hex codes if you merely wish to reproduce the bug.

6) Please do not forget what I observed later, i.e. you can restore the original correct display in the original "Western (Mac Roman)" encoding. You simply have to switch back and forth twice. Again, I believe this only confirms the bug.

7) Any consideration of actual conversions may be interesting (#7), but I would suggest to leave that aside and to separate clearly from this bug discussion, since they have nothing to do with each other. At best we might consider explaining a bit better and more conspicuously that the switch in the status-bar does not convert, i.e. does not alter the file content in any way.

8) So please, let's stick to my sequence of steps to reproduce the bug.

On the other hand for the debugging of this bug, of course, there the hex package may come very handy to learn why and when exactly Atom produces this bug and converts bytes wrongly it should leave untouched. My suspicion with no further testing is that it makes the mistake at the very beginning, i.e. uses UTF-8 or something else in the middle of "Western (Mac Roman)" editing by using hex 8A for Umlaut 'ä' (lower case a) instead of the correct hex E4 etc. Or something else is going wrong here.

lee-dohm commented 8 years ago

You make some good points. And if in step 6 (your repro steps) you didn't save the file, I might agree with you that this is a bug. But when you tell Atom to save the file with the encoding UTF-8, it should save a valid UTF-8 file. The only way to do that is to convert the invalid bytes to the U+FFFD character. The unique thing about UTF-8 is that some byte sequences are invalid. Single-byte encodings (such as Mac Roman) do not have this quality. This is why you see the change happen in the byte sequence one way and not the other.

Atom can only work with the information you give it. When you tell Atom to save a file as UTF-8 that has invalid characters in it, it will perform a conversion as expected under Unicode so that the file is a valid UTF-8 file.

lee-dohm commented 8 years ago

My suspicion with no further testing is that it makes the mistake at the very beginning, i.e. uses UTF-8 or something else in the middle of "Western (Mac Roman)" editing by using hex 8A for Umlaut 'ä' (lower case a) instead of the correct hex E4 etc.

It does use 8A for ä in my tests, but I assume that's because I copied the text from this web page which probably uses UTF-8 encoding.

afischli commented 8 years ago

I made some further tests by repeating my sequence of steps 1. to 7. and noticed that changing the encoding does not alter the display. This is wrong, see my previous arguments 4) and 5). The new test steps 1. to 7. without savings show that obviously Atom makes a mistake by showing on the display always the same characters regardless of current encoding and only starts "converting" internally to actually encode the bytes when it saves the file. Then the current encoding matters.

In my view this is wrong and Atom should encode upon entering some text by storing the proper byte sequence in the buffer, so that saving or not saving does not matter. As soon as you change the encoding, the display should change accordingly. Meaning, if I enter some Umlaute while "Western (Mac Roman)" encoding is active, say "äöü" should immediately change its display as I change to UTF-8 encoding. In this example the displayed string should change immediately to "���" when switching encoding to UTF-8 and back to "äöü" when switching encoding back to "Western (Mac Roman)".

Thus to simplify the reproduction of the bug these steps can do it:

1) Switch encoding to "Western (Mac Roman)" 2) Enter "äöü" 3) Switch encoding to UTF-8 4) Displayed string is "äöü"

This is the bug. Correct would be the display of "���", which switches back to "äöü" when going back to "Western (Mac Roman)" encoding. File saving should not matter at all, which of course does also produce incorrect behavior, perhaps another bug in the sense that internally different parts of Atom's program code are involved. But in terms of handling non-converting encoding correctly throughout, it is the same bug. Atom does not do what it should. It should never convert and encode everything always in the current encoding scheme.

afischli commented 8 years ago

I am not aware that Unicode requires byte streams, say a large file of bytes, to be encoded throughout in a consistent manner. AFAIK Unicode is merely an internal standard, which universally assigns ordinal numbers to characters. Encoding of Unicode is an entirely different story and can follow different schemes, e.g. UTF-8 as well as UTF-16 are both Unicode encoding schemes. But they do that differently. For details read perhaps my article on encoding «Encoding or "if what you see is not what you want"» (Evernote requires me to know your e-mail for that).

Why and on what basis should Atom suddenly start converting inconsistently encoded byte sequences, and try to enforce the current encoding on everything entered so far? This would create quite a mess. Say you enter some text in ISO-Latin, then switch to "Western (Mac Roman)", then UTF-16 and now save this to a disk file. Why should UTF-16 be the correct encoding for all? It isn't and it can't. For each portion first of all the encoding should be used that was in use when the text was entered. Any encoding that overlaps with that text fully could also be used of course. But it is easy to find an example with only 3 characters where above sequence is wrong, regardless which of the 3 encoding schemes you choose, one character is always in the way and conversion fails. Of course Atom could detect such encoding inconsistencies and warn the user about the fact or refuse to save (of course not a good idea) or whatever. But the problem remains that many encoding schemes overlap only partly, misleading the user to switch encoding while typing within the overlap. In other words, only humans can fix such mixed encodings, Atom can't do that for you.

Making these arguments I would say a good solution would probably be to try to "test convert" the current file to the new encoding as the user wants to switch, say in a temporary storage only, and then warn the user if the conversion fails (iconv -f -t ). That would also be a good moment when Atom could recommend that he/she should not change the encoding, since previously entered text can no more be properly displayed in the new encoding, so that she/he can still abort the switch. Then a user would still have a choice to either force the switch at the risk that some text, say some Umlaute, would have become "unreadable" until fixed by the author by properly converting to the new encoding or give up on the encoding switch. Of course, in case all previously entered text can also be encoded in the new encoding, i.e. all text is within overlapping domains, e.g. ASCII only can fully be encoded without change in UTF-8, then the switch would not warn the Atom user and all is anyway fine, changing encodings do not matter.

It seems to me the problems in Atom as I have pointed them out are real and should be addressed in the manner as I described. Your "justification" that Atom has to convert seems to me not to be justified, because Atom can never do it right.

Summary: 1) Fix the bug by storing throughout anything entered always in the current encoding and never convert (saving does not matter) 2) Warn the user if an encoding switch may create a "file", i.e. a byte sequence, that contains mixed, incompatible encodings

Of course a third optional measure would be to offer conversion facilities from within Atom, but I believe this is a feature request and has little to do with the bug(s) I discuss here. Moreover, I guess most Atom users can always go to a shell and use 'iconv' to make a conversion.

afischli commented 8 years ago

P.S.: Please remember, 'ä' is Unicode code point U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) and its UTF8 encoding requires two bytes, i.e. hex C3 A4. The only alternative encoding also using Unicode of 'ä' I know of is the 3 byte encoding using two Unicode code points, i.e. U+0061 (LATIN SMALL LETTER A) combined with U+0308 (COMBINING DIARESESIS). So if I'm not mistaken hex 8A can't be UTF-8 encoding for 'ä', since that would need to be at least 2 bytes. It is rather some 1 byte encoding scheme, probably whatever the browser is currently using, e.g. Western (ISO Latin 1) or whatever. Can't say at the moment.

afischli commented 8 years ago

Having given this a bit further thought I have come to a somewhat modified view of this problem, I would call this a conceptual bug, which clearly needs fixing.

First I suggest another procedure to reproduce the bug and demonstrate its effect using a file ('testUmlaute.zip' attached) consisting of 4 bytes, i.e. string "äöü" in hex 'C3 A4 9A FC' meaning first 2 bytes represent Unicode Code point U+0061 in UTF-8 encoding, 3rd byte represents Unicode Code point U+00F6 in 'Western (Mac Roman)' encoding and 4th byte Unicode Code point U+00FC in 'Western (Windows 1252)' encoding (cf. e.g. http://www.utf8-chartable.de).

With this file reproduce the bug and demonstrate its effect by these steps:

1) Open the attached file in Atom (default encoding UTF-8), display: 'ä��' 2) Switch encoding to 'Western (Mac Roman)', display: '√§ö¸' 3) Switch encoding to 'Western (Windows 1252)', display: 'äšü' 4) Switch encoding to UTF-8, display: 'ä��' 5) "Save as..." the file under a new name, e.g. 'testUmlaute-savebug', display: 'ä��' 6) Switch encoding to 'Western (Mac Roman)', display: '√§ÔøΩÔøΩ' 7) Switch encoding to 'Western (Windows 1252)', display: '�'

Atom 1.3.3, OS X 10.9.5 (Mavericks)

The procedure given above is much cleaner, since it is independent of any entering of text and thus narrows the scope of where Atom changes the content considerably. Note, actually only steps 1) and 5) are needed to reproduce the bug, but to demonstrate clearly the effect of the bug the other steps are listed too. This means, up to and including step 4) all is as it should be. Step 5) loses the information and as of step 6) content is altered and displays become accordingly different.

The content of the file has obviously changed, albeit its content was not edited in any way. In my view this is not acceptable. Investigating the content of the newly saved file 'testUmlaute-savebug' is: hex 'C3 A4 EF BF BD EF BF BD'. The original information is lost. The edit buffer is treated differently from what is saved to a file. I call this therefore a conceptual bug.

Comment: Then, while I have some sympathy for not losing the users text while switching to an encoding that can not represent all text, i.e. keeping all entered text intact in the edit buffer, the user must be warned at least that upon saving the file, the editing will be converted and irrecoverably corrupted. Better would be to warn the user right away also when she attempts to switch. Then a 2nd time if there are still '�' characters present in the UTF-8 text at any moment before saving the file. The same warnings need of course to be issued for any other encodings, not only for UTF-8, that cannot fully encode what was originally entered.

testUmlaute.zip

afischli commented 8 years ago

Perhaps with above contribution I best reopen the issue?