atom / encoding-selector

Pick an encoding for the current editor
MIT License
21 stars 28 forks source link

Support for Unicode encodings with BOM #18

Open notpeelz opened 9 years ago

notpeelz commented 9 years ago

Variants of common Unicode encodings aren't supported, namely, UTF-8 w/ BOM, UTF-16 w/ BOM, etc.

I feel like Atom should natively support all Unicode encodings, even the most esoteric ones (e.g. UTF-32)

kevinsawicki commented 9 years ago

Can you link to a file that is currently failing to auto detect properly?

notpeelz commented 9 years ago

The problem is not so much as to how files are read, but how they're saved; upon saving a AutoHotkey script, the file is saved in regular UTF-8. This is where the issue lies—it seems that the AHK engine has a non-standard reading-byte order.

kevinsawicki commented 9 years ago

Can you link to an AutoHotkey script in a repo somewhere that Atom is corrupting when it is saved?

notpeelz commented 9 years ago

Expected result (UTF-8 w/ BOM): image

Actual saved file (created within Atom): image

The issue can be easily reproduced by copying & pasting the "ù" (C3 B9) character (or any codepoint using 2+ bytes) into Atom and then saving it; the BOM sequence clearly doesn't appear upon saving, as it is atypical for the UTF-8 encoding.

For what it's worth, here's the relevant script shown in the screenshots:

#SingleInstance, Force
#UseHook, On
#NoTrayIcon

!+u::
!u::
  If (GetKeyState("Capslock", "T") ^ GetKeyState("Shift", "P"))
    SendInput, Ù
  Else
    SendInput, ù
Return

!`;::`

Just for clarifications: Atom isn't actually corrupting anything, it just doesn't have inherent support for creating files in certain Unicode encodings.

heennkkee commented 9 years ago

Hi, Is there anything planned to fix this issue? Having some problems with asp classic and Atom because of this right now, and would really like to use Atom for these projects but unable to at the moment due to that.

imamatory commented 8 years ago

Unable to work with atom cause this issue. Failing with cyrillic symbols, with BOM that's ok. Very hope it planned to fix this.

nicktimko commented 8 years ago

I hit this bug when attempting to view a PowerShell redirection logfile (presumably they're all the same encoding?). Atom doesn't recognize the file as UTF-16 LE

PS> echo test > utf16le-with-bom.txt

image

The annoying part was that even though it looked kinda OK as UTF-8 (except for the BOM that it rendered), the null characters between every printable character totally broke searching.

rkrv commented 8 years ago

This is really problematic when working with .Net and Razor templates. Our project uses UTF-8 with BOM and IIS simply freaks out when it sees international characters in views if the file wasn't saved with BOM.

samjgalbraith commented 8 years ago

SQL server does not deal well with BOM at the start of scripts, and this causes deploy problems for us. I have to open and save in a different text editor that explicitly allows UTF-8/UTF-8 BOM distinction so that I can make sure there's no BOM there otherwise the deploy fails. So for us this is an irritating inconvenience.

nicktimko commented 8 years ago

Maybe of-interest is that Visual Studio Code (also an Electron/JS+Chromium app) detects them correctly. Any intrepid JS programmers want to do some copy-pasting?

nicktimko commented 8 years ago

Actually, looking at a recent PR #32, how does adding an entry to the dictionary magically cause atom to support a new encoding? Are they just those that Chromium natively supports but need to be "enabled" by having such an entry? Is it just that easy for "UTF-x with BOM"?

VSC seems to sniff out the BOM from the buffer...iconv-lite (npm package) says it strips off the BOM later (used by VSC and Atom?)...

skalka7 commented 8 years ago

This is really needed, due to this issue I need to save every Sass partial with a different editor (Sublime) first, then I can start to work with Atom. Sometimes I commit just to add this BOM to every file. I can live with it, but I will be glad to see this function implemented.

nixel2007 commented 8 years ago

@skalka7 feel free to use this quick hack - https://gist.github.com/nixel2007/ed991d8001fbb59689fb change the extension in third line and add this to you init.coffee script

skalka7 commented 8 years ago

Thanks @nixel2007, it's a workaround but it works. One thing: on third line slice size has to be adjusted according to the extension (-5 in my .scss case).

marconn commented 8 years ago

Digging a little in this package i found the reason why UTF-8 w/ BOM, UTF-16 w/ BOM are not working

It's due to JsChardet version. This package uses ^1.1.0 (look at package.json). Somehow, npm doesn't get the latest version so it uses 1.1.0 as default. Checking JsChardet commit history, version 1.1.0 doesn't support UTF-8 with BOM.

Digging a little more, i downloaded the atom repo, built it and set in encoding-selector's package.json (located in node_modules folder) the latest version for JsChardet (^1.4.1) directly.

Then to solve UTF-32 issue, i added in main.coffe options to select UTF-32 BE and UTF-32 LE. image

After that i built atom again, run it and opened a utf-8 with bom file and it looked good. One thing that I noted was that UTF-16 files are auto-detected as UTF-8. image

I didn't test UTF-32 since either notepad++ or sublime can save that kind of files, so, i wasn't able to create a file to make a quick test.

Finally, i didn't make a pull request due to my job contract limitations :cry: so, hopefully someone can create one to test it further, resolve the issue and make atom better :smile:

50Wliu commented 8 years ago

It's due to JsChardet version. This package uses ^1.1.0 (look at package.json). Somehow, npm doesn't get the latest version so it uses 1.1.0 as default.

^1.1.0 should support 1.4.1. Furthermore, the 1.1.0 release explicitly says Detects UTF w/ BOM strings when streamed. So I'm not sure why "upgrading" to 1.4.1 fixed the issue in your case, but I'm highly suspect that that's not the root cause here.

SuneRadich commented 7 years ago

I am running Atom 1.11.2, and I have issues with UTF-8 with/without BOM. Shouldnt it have been resolved?

MFernstrom commented 7 years ago

I'm at 1.12.7, this is still an issue, forcing me to abandon an otherwise nice editor.

wolfsgrove commented 7 years ago

Indeed, this seems to be a problem to me as well. I'm modding Stellaris, and the localisation files need to have BOM, and files created with Atom don't have that and there's no way to add BOM manually within Atom.

scanarjo commented 7 years ago

I'm also encountering UTF-16 LE w/BOM being detected as UTF-8 😞

jjarocewicz commented 7 years ago

This really needs resolved. I love this editor, but am forced not to use it because of this issue. I simply need to be able to see the BOM's in text that I copy/paste from Photoshop. Brackets does this beautifully but inserting a bright red dot where the BOM is. In atom there's no indication at all, which is causing rendering issues due to these BOM's.

kuzyn commented 7 years ago

I have resorted to a kludgy solution gleaned off SO, but I share the above annoyances

apdrsn commented 7 years ago

I also have this problem

fuba82 commented 6 years ago

same here

looks like this in Notepad++: image3

and this is in Atom: image5

not very funn to convert all the files before open in Atom... :/

TristanBomb commented 6 years ago

It's a little strange that Atom doesn't support this. There are a couple games out there that, when modding, require BOM format. Now I'm forced to either use an inferior editor, or to juggle around bash commands every time I want to save something.

abelbraaksma commented 6 years ago

I just started out with Atom, but unfortunately, this is an absolute dealbreaker, as working with source code files is a requirement for me and this prohibits working with source code in many ways (and I see Atom as a source code editor).

As a result of this bug, it is impossible to edit XML, (some) HTML and many other source files. Unicode is ubiquitous since about two decades and the algorithm to detect the encoding is rather straightforward and simple. I don't know enough of Atom, otherwise I'd create a PR myself. Anybody attempting this but with little Unicode knowledge, I suggest to read Splolsky's famous article: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.

As a bonus, the encoding-detection for XML files (and by extension, HTML and related standards) should be done by parsing the prolog (i.e., the part that reads something like <?xml version="1.0" encoding="utf-16"?>). The XML standard gives a clear an unambiguous algorithm to determine the encoding. This algorithm can also be used for files without a prolog, see: https://www.w3.org/TR/xml/#sec-guessing

Note that the W3C consortium says that at a very minimum, XML/HTML processors (and thus also, editors) must support UTF-8/UTF-16 and US-ASCII encodings. UTF-32 is not in that list (nor are older encodings like UCS-2 or UTF-7). Each encoding can be with/without BOM, and can be LE or BE. Since UTF-8 (with/without BOM) is common on the Web, UTF-16 (mostly with BOM) and sometimes UTF-8 (with BOM) is common in Windows, and UTF-8 and UTF-32 are common in Linux/Unix, I think it makes sense to support these properly (see http://unicode.org/faq/utf_bom.html).

Note also that, just about every standard out there requires Unicode support, so the lack of this means lack of (fully) supporting widely accepted standards from W3C, OASIS, ISO etc, etc.

Here's an example of opening just some XML log file on Windows in Atom, essentially unreadable/unsuitable to be viewed or edited:

image

50Wliu commented 6 years ago

@abelbraaksma Have you explicitly set the encoding to the encoding of the file? Atom defaults to UTF-8 for every file but it can be changed. UTF-16 is in that list, and last time I tried this out UTF-8 with BOM worked when UTF-8 was selected.

abelbraaksma commented 6 years ago

@50Wliu, I'm not sure what to specify where, could you elaborate? This is just out-of-the box behavior of Atom. I'd rather not have Atom default to UTF-16, as then it would wreck my UTF-8 files. I'd rather have a way that it understands the five most basic of Unicode encodings (preferably out of the box): UTF-8, UTF8/BOM, UTF-16 (must always have BOM!), UTF-16-LE, UTF-16-BE.

50Wliu commented 6 years ago

You should see something like this in the lower-right corner: status-bar-right Try changing UTF-8 to UTF-16 LE/BE, or selecting the Auto Detect option.

abelbraaksma commented 6 years ago

@50Wliu, thanks, never thought of clicking in the statusbar, but yes, that is a good workaround.

Still, this is a bug, as when you open an XML file that is clearly a UTF-16 (or UTF-16-BE/LE) file, which means, in the case of a BOM (my file), it is always an invalid UTF-8 file (that is, UTF-8 cannot start with 0xFF, 0xFE), and in the absence of a BOM, it becomes invalid as soon as there's a codepoint > 0x7F.

So I think Atom has a (small, but poignant) problem here, as it opens such files with the wrong defaults (strangely, explicitly selecting "auto-detect" does detect the encoding correctly).

50Wliu commented 6 years ago

So I think Atom has a (small, but poignant) problem here

Yes, we're well aware of this :/. I have tried solving it, and I'm 90% of the way there, but as you probably know the last 90% takes 90% of the time. Basically, auto-detection works, but I'm not sure how to handle the time between the file opening and the auto-detection finishing (as the two are both asynchronous). Any edits made to the file before auto-detection finishes need to be handled correctly somehow and that's the last part that I'm stuck on.

N0x1k commented 6 years ago

+1 having the same issues. Incorrect identification of the formatting opens in UTF-8 instead of UTF-16 LE) and the saving doesn't seem to be working for UTF-16 LE as well. I really like Atom, but this is really a show stopper for me...I have to change the formatting every time I want to open logs from PowerShell and I basically can't use it for VBS scripts the I'm working with every day.

abelbraaksma commented 6 years ago

Any edits made to the file before auto-detection finishes need to be handled correctly somehow and that's the last part that I'm stuck on.

@50Wliu, I think it shouldn't be possible to edit a file before you have its encoding correctly, because such edits are undeterministic by their very nature (i.e., consider typing a newline, and you insert a byte 0x0D somewhere in a position X in the file, only to find out that the encoding is of variable length (UTF-8) and position X was based on an assumed UTF-16).

So, in short, I think the solution should be: don't allow editing until you have detected the encoding.

Since encoding detection only requires reading the first 3 or so bytes, and editing will usually happen somewhere after these first three bytes, plus showing the file and parsing it with the chosen font requires a lot of time comparably, I think it is safe to disallow editing while you detect the encoding (and I think you'll be hard pressed to edit at all before encoding is finished).

One other reason for blocking editing: you shouldn't even display a file before encoding finishes (or you have to render it twice, which is suboptimal), so how is editing possible at all?

50Wliu commented 6 years ago

@abelbraaksma Thanks for the feedback! I'm not sure if a file can be marked as read-only but I'll look into it.

Since encoding detection only requires reading the first 3 or so bytes

Unfortunately, this is only true for Unicode-esque encodings (see https://github.com/aadsm/jschardet/blob/28152dd8db5904dc2cf9aa12ef4f8783f713e79a/src/universaldetector.js#L75-L106). For other encodings it's a guessing game that requires looking over a considerable portion of the file.

abelbraaksma commented 6 years ago

For other encodings it's a guessing game that requires looking over a considerable portion of the file.

Yes, that's true, and that guessing game can get it wrong or it is simply provably inconclusive (if a file only contains the word "hat" in ASCII, it can also be meant to have been encoded in any of the ISO-8859 variants, or CP1252 and if you don't know it was meant to mean "hat", it can be UTF-8-without-bom, or even EBCDIC).

Though the guessing-game with certain known file formats is a whole lot easier, i.e. XML has the encoding attribute in the prolog, HTML has its meta-info and so on. For such formats, it is not a guessing game, it is deterministic.

But for the non-deterministic cases you have the ability to change the encoding afterwards if the guessing is wrong.

I'm not sure if a file can be marked as read-only but I'll look into it.

@50Wliu, If you can't make it read-only, just don't display it before you know what it is. I mean, with coloring it doesn't matter that the coloring is asynchronous, but you really don't want to display a page prior to you knowing the encoding (consider what would happen if you guessed UTF-8 only to find out after 1 second that it is UTF-16, or Big-5, and then re-rendering). Users would go berserk and may think a virus is playing games with them ;).

SpadeAceman commented 6 years ago

@50Wliu, this developer took what I think is a very practical and realistic approach to automatic detection of Unicode text encodings, with and without BOM.

https://github.com/AutoItConsulting/text-encoding-detect

(The code itself may not be of direct use to you, but the developer explains the detection logic very well in the README.)

Aside from the reasonable detection methods described, a critically useful assumption mentioned at the end is that null characters (0x00) realistically won't be present in legitimate UTF-8 text. Implementing this assumption in Atom would immediately resolve the documented issues above, where UTF-16 files are being mistakenly auto-detected as UTF-8.

abelbraaksma commented 6 years ago

is that null characters (0x00) realistically won't be present in legitimate UTF-8 text

Not quite. 0x00 is a valid codepoint in Unicode and when encoded for UTF-8, it will be presented as a 16-bit zero value. It is true that text shouldn't contain null-characters, but that's an age old discussion and Unicode is designed to be liberal in what it accepts, which means that the old ASCII NUL (yes, it was valid in ASCII too) is valid in Unicode, and therefore in UTF-8.

That aside, the algorithm for "if you know it should be unicode, then what is the encoding used" is deterministic. Any codepage-related encoding (like the ISO-8859-X family, or EBCDIC) is never deterministic and can only be guessed. Some special cases, like Shift-Jis, Big5 and GB18030 are relatively easy to detect and in 99% can be deterministic. Special formats (known source code formats, XML, HTML, SVG) can be detected deterministically.

wismill commented 6 years ago

Is it possible to add a special encoding "utf-8-bom"? It is supported by Notepad++ and Python for example. Not having this option is a dealbreaker to work with some formats that rely on it, especially on Windows.

Buzut commented 5 years ago

When saving to utf-16, no BOM is included, this should be the case, or at least an option should be provided.

wbt commented 5 years ago

Cross-linking to some relevant content: ethereum/go-ethereum#19905 (Windows powershell's "echo" includes BOM that breaks things)

hsandt commented 4 years ago

Cross-linking to https://github.com/atom/atom/issues/3427 which is about BOM character being displayed at the beginning of file, and being copied when copying the first line.

Ideally we should have all the BOM variants added to the encoding list, and auto-detect them when opening the file.

NexthinkGuru commented 4 years ago

How is this still an issue! come on, its easy enough.

abelbraaksma commented 4 years ago

Luckily, it's open source, anybody finding or knowing it's easy, is free to open up a PR with a suggested fix. This thread contains enough background to get a good idea of the pitfalls and expected behavior.

jmoleano commented 2 years ago

FYI this is still a problem today... If you're like me and landed on this page you may stop your search because there is still no solution.