Closed GoogleCodeExporter closed 9 years ago
1.) Let us assume that the client is not specfying a character encoding (i.e.,
an application using the SCP client)
2.) The the software must then always behave identically, no matter where it is
being run. As a consequence, we cannot use the platform's default encoding
(think of, e.g., your application suddenly being run on an EBCDIC machine).
3.) As a default, we must therefore pick a default encoding that is a.) present
in every JVM implementation and b.) works for the "typical" use case (e.g., scp
to a remote filesystem with umlauts in filenames)
ISO8859_1 is such a candidate. It is backwards compatible to US ASCII and
supports most languages in western europe (however, it lacks the EUR symbol).
I am not sure, but I think that every possible character encoding in ISO8859_1
has the same encoding in UTF-8 as well. Before switching to UTF-8 (or any other
encoding) we should be careful, as this may break existing applications that
use the library.
Original comment by cleondris
on 31 May 2011 at 7:59
Doesn't the SSH specification say something about a default character encoding?
Original comment by dkocher@sudo.ch
on 31 May 2011 at 8:06
I see three important places where encoding matters a lot:
1.) Filenames in SCP. Unfortunately, SCP is not documented as part of the SSH
RFCs. The only reference is the OpenSSH source code (and the rcp source code).
2.) Filenames in SFTP. The latest SFTP draft says "The preferred encoding for
filenames is UTF-8". The draft also defines a mechanism to disable the
translation of native filenames as a fallback.
3.) Starting a shell or a command ("SSH_MSG_CHANNEL_REQUEST", RFC 4254 section
6.5). The RFC says that the "command" argument is stored using the SSH "string"
datatype, however *nothing* about the used encoding (in other places, the
encoding is explicitly specified, see, e.g., section 5.1
SSH_MSG_CHANNEL_OPEN_FAILURE: the description is a string in in ISO-10646 UTF-8
encoding).
Let's see what the specs say in general:
RFC 4251 (in section 4.5) has a general remark about data passed over the SSH
protocol that is shown to the user (e.g., questions in interactive
authentication, banners, etc.): "In most places, ISO-10646 UTF-8 encoding is
used".
When defining the SSH "string" datatype (section 5), the RFC says: "Strings are
also used to store text. In that case, US-ASCII is used for internal names, and
ISO-10646 UTF-8 for text that might be displayed to the user."
I think we should switch from ISO8859-1 to UTF-8 and give the user (where
applicable) the chance to override the default encoding. Also, we need to
clearly document this in the release notes.
Original comment by cleondris
on 2 Jun 2011 at 8:33
Great extensive analysis. It matches my experience in practice that UTF-8 is
the best match for SSH as I never had any issues with. Please review r40.
Original comment by dkocher@sudo.ch
on 2 Jun 2011 at 10:19
Sorry, need to revise my own statement about encoding:
"I am not sure, but I think that every possible character encoding in ISO8859_1
has the same encoding in UTF-8 as well."
The above is *wrong*.
While every character in ISO 8859-1 represented by the *byte* XY has the same
*codepoint* XY in Unicode, this does not mean that the byte XY is a valid UTF-8
encoding for the codepoint.
Example: 0xE4 represents in ISO 8859-1 and in Unicode the german umlaut "a".
However, to encode the value "0xE4" in UTF-8 one needs the two-byte sequence
"0xC3 0xA4".
In other words, the *codepoints* 0x00-0xFF have the same meaning in ISO 8859-1
and Unicode. However, the *encoding* of those codepoints is not the same in
UTF-8 and ISO 8859-1.
(...and let's ignore for the moment the difference between "ISO-8859-1" and
"ISO 8859-1" aka. ISO/IEC 8859-1 (see the difference?)).
What is the consequence? Well, if we switch the default encoding from Java's
"ISO8859_1" to "UTF-8", then we will break many applications that rely on that
default behavior.
One could demonstrate the problem by connecting to a SFTP server which encodes
filenames in Windows-1252 (cp1252) and has filenames that contain german
umlauts.
Original comment by cleondris
on 13 Jun 2011 at 3:11
Is it possible to programaticlly change the encoding somehow?
Original comment by stures.s...@gmail.com
on 25 Aug 2011 at 1:26
Original issue reported on code.google.com by
dkocher@sudo.ch
on 28 May 2011 at 9:39