Closed winstliu closed 3 years ago
Update: readFileSync
is not going to be an acceptable solution, especially because this is done on file load.
Caveats:
text-editor.js
is incorrectly detected as cp1252 instead of utf8.0.95
. Is that too low? It will definitely take longer to figure out the encoding the higher that is, but maybe that's acceptable.It's late on a Friday and I'm in a hurry, so please forgive me if any of this feedback is missing out on important details.
My concern with the approach in this group of PRs is that we do a lot of redundant I/O in the event that we're detecting a non-default encoding. We first do I/O in superstring to load the buffer before detecting the encoding, then we do I/O again to feed bytes to the jschardet, then we update the encoding triggering another reload.
It seems like we want to detect the encoding before loading the buffer at all. It seems like there are two options of varying difficulty to implement:
The first is to keep detection in JS and do a limited amount of I/O in TextBuffer.load
in order to feed some bytes to jschardet
. I would want a guarantee that we would stop if we didn't detect the encoding after some number of chunks so as not to read the entire file. Note that there are two code paths in TextBuffer.load
, one taking a path and one taking a file.
The other approach is more optimal but harder to implement: Do the encoding detection in superstring in the load_file
method. This method already does I/O and performs an encoding conversion. Perhaps it could be modified to optionally first detect the encoding based on the first batch of read bytes. This would avoid the redundant I/O but raise the difficulty and risk.
@maxbrunsfeld Do you have thoughts on this? How bad would it be to do enough I/O in the JS side of buffer loading to detect the encoding?
I'm not sure how long encoding detection takes, and how much data needs to be loaded. If we want it to happen on a background thread as part of superstring's file loading, we could definitely do it in C++. The uchardet C++ library seems to be used by a lot of things.
EDIT - Hmm, :point_up: that library is MPL- licensed. Not sure if that's ok for Atom.
@50Wliu want to try your hand at some C++?
We'll see 😬 I'll probably need a lot of hand-holding though. Will be my first time writing anything C/C++.
Ok, I took a look at uchardet. This is my first time trying to understand C++ code so my analysis could be way off.
It looks like their model is "give us as many bytes as you have, and call uchardet_get_charset(ud)
once you're finished". The uchardet header doesn't expose confidence information so we can't opportunistically short-circuit if the confidence level is high enough, and on the flip side uchardet could also give us a charset that it's only 20% confident about and we wouldn't know any better.
So "detect[ing] the encoding based on the first batch of read bytes" may be unreliable using uchardet. It would also misreport the encoding if the special characters are only later on in the file.
Thoughts?
@50Wliu In an auto-detection scenario, it might be okay to buffer up all the bytes (or up to some reasonable maximum) without transcoding them in order to hand them to to uchardet first. This would still be better than performing the actual I/O twice.
Not sure if this helps, but it looks like this is what VS Code is doing for detection: https://github.com/Microsoft/vscode/pull/21416/files
They're also using the JSCharDet library
This is a possible fourth part to the move-encoding-detection-into-core saga. If merged, it will allow https://github.com/atom/text-buffer/pull/220 to use these methods instead of having to require jschardet and iconv-lite by itself.
One thing that I'm not happy about is the use ofreadFileSync
. Making the whole chain use callbacks feels ugly to me and is also inconsistent withsetEncoding
andgetEncoding
, though it's definitely a possibility.Fixes atom/encoding-selector#51