Add encoding auto-detection logic

winstliu commented 7 years ago

This is a possible fourth part to the move-encoding-detection-into-core saga. If merged, it will allow https://github.com/atom/text-buffer/pull/220 to use these methods instead of having to require jschardet and iconv-lite by itself.

One thing that I'm not happy about is the use of readFileSync. Making the whole chain use callbacks feels ugly to me and is also inconsistent with setEncoding and getEncoding, though it's definitely a possibility.

Fixes atom/encoding-selector#51

winstliu commented 7 years ago

Update: readFileSync is not going to be an acceptable solution, especially because this is done on file load.

winstliu commented 7 years ago

Caveats:

The current test file for cp1255 is actually cp1252 (according to @Ben3eeE).
Atom's text-editor.js is incorrectly detected as cp1252 instead of utf8.
An encoding is returned once the confidence interval is above 0.95. Is that too low? It will definitely take longer to figure out the encoding the higher that is, but maybe that's acceptable.
Since this is asynchronous now, I'm not sure what will happen if you manage to edit the file before the encoding is changed.

nathansobo commented 7 years ago

It's late on a Friday and I'm in a hurry, so please forgive me if any of this feedback is missing out on important details.

My concern with the approach in this group of PRs is that we do a lot of redundant I/O in the event that we're detecting a non-default encoding. We first do I/O in superstring to load the buffer before detecting the encoding, then we do I/O again to feed bytes to the jschardet, then we update the encoding triggering another reload.

It seems like we want to detect the encoding before loading the buffer at all. It seems like there are two options of varying difficulty to implement:

The first is to keep detection in JS and do a limited amount of I/O in TextBuffer.load in order to feed some bytes to jschardet. I would want a guarantee that we would stop if we didn't detect the encoding after some number of chunks so as not to read the entire file. Note that there are two code paths in TextBuffer.load, one taking a path and one taking a file.

The other approach is more optimal but harder to implement: Do the encoding detection in superstring in the load_file method. This method already does I/O and performs an encoding conversion. Perhaps it could be modified to optionally first detect the encoding based on the first batch of read bytes. This would avoid the redundant I/O but raise the difficulty and risk.

@maxbrunsfeld Do you have thoughts on this? How bad would it be to do enough I/O in the JS side of buffer loading to detect the encoding?

maxbrunsfeld commented 7 years ago

I'm not sure how long encoding detection takes, and how much data needs to be loaded. If we want it to happen on a background thread as part of superstring's file loading, we could definitely do it in C++. The uchardet C++ library seems to be used by a lot of things.

EDIT - Hmm, :point_up: that library is MPL- licensed. Not sure if that's ok for Atom.

nathansobo commented 7 years ago

@50Wliu want to try your hand at some C++?

winstliu commented 7 years ago

We'll see 😬 I'll probably need a lot of hand-holding though. Will be my first time writing anything C/C++.

winstliu commented 6 years ago

Ok, I took a look at uchardet. This is my first time trying to understand C++ code so my analysis could be way off.

It looks like their model is "give us as many bytes as you have, and call uchardet_get_charset(ud) once you're finished". The uchardet header doesn't expose confidence information so we can't opportunistically short-circuit if the confidence level is high enough, and on the flip side uchardet could also give us a charset that it's only 20% confident about and we wouldn't know any better.

So "detect[ing] the encoding based on the first batch of read bytes" may be unreliable using uchardet. It would also misreport the encoding if the special characters are only later on in the file.

Thoughts?

nathansobo commented 6 years ago

@50Wliu In an auto-detection scenario, it might be okay to buffer up all the bytes (or up to some reasonable maximum) without transcoding them in order to hand them to to uchardet first. This would still be better than performing the actual I/O twice.

calebmeyer commented 6 years ago

Not sure if this helps, but it looks like this is what VS Code is doing for detection: https://github.com/Microsoft/vscode/pull/21416/files

They're also using the JSCharDet library

winstliu commented 3 years ago

https://github.com/atom/atom/pull/13943#issuecomment-930622479

atom / node-pathwatcher

Add encoding auto-detection logic #120