Specify encoding of .license files

mxmehl commented 2 years ago

Fixes #73

I took the very simple suggestion by @kirelagin. @silverhook had some concerns regarding UTF-8 but if I understood correctly they were cleared. Please correct me if I'm wrong :)

silverhook commented 2 years ago

The issue with Unicode in general was that in the embedded sphere to save space they tend to use ASCII encoding.

Personally, I much prefer Unicode, but it might make sense to check with the embedded community if this messes anything up for them. (and if so, we’d need to weigh the two)

mxmehl commented 2 years ago

True, but how about what has been said in the issue?

UTF-8 is an ASCII-compatible encoding (a superset of ASCII where every byte value that is allowed in ASCII means the same thing in UTF-8), so every ASCII text file is automatically a valid UTF-8 text file.

Wouldn't then implementors be able to use ASCII encoding and still comply with the spec?

silverhook commented 2 years ago

I know enough about character encoding that I know Unicode includes all of ASCII, so it probably should fall back gracefully.

So I did some tests on my machine what happens if I save BSD-4-Clause with my name as copyright holder and encode it differently. But take this with a huge pinch of salt, as I’m nowhere near either an encoding or embedded expert.

The results seem to be that:

for UTF8 only the non-ASCII characters consume more space
for UTF16 it does not seem to matter, it’s double the size per character however you spin it (according to Wikipedia it is also the only web-encoding that is incompatible with ASCII; this side)

So given that I’d say at least UTF8 should not make things more complicated.

converted to ASCII with iconv:

11 lines
241 characters
1634 bytes

UTF8, no non-ASCII chars (= Matija Suklje):

11 lines
241 characters
1634 bytes

UTF8, with special chars (= Matija Šuklje):

11 lines
242 characters
1643 bytes

UTF16, no non-ASCII chars:

11 lines
241 characters
3270 bytes

UTF16, with special chars:

11 lines
241 characters
3270 bytes

P.S. This webpage also seems interesting on this topic: http://utf8everywhere.org/

kirelagin commented 2 years ago

Hey everyone, so, I think let’s first fix the terminology in order to avoid any possible confusion.

Unicode is not an encoding, it is just a huge list of characters (codepoints), so basically it’s just a collection of all symbols that computers can work with (like, letters, digits, punctuation, hieroglyphics, emojis, etc.). There is really no alternative to it, so we are always implicitly talking about Unicode – if a character is in Unicode, you can use it in a computer; if it is not – then you can’t.

Now, Unicode is a list of “abstract” characters. The real question is how to represent those characters inside a computer, because computers want bytes (or rather bits). That’s where encodings come into play as they specify how to serialise a Unicode character into a sequence of bytes and deserialise it back. Some encodings cover the entirety of Unicode (e.g. UTF-8), some only cover some subsets (e.g. ASCII only supports digits, latin letters, and some control characters – a total of 128 possible characters, much less than all of Unicode).

Next, file types. When you have some file (a sequence of bytes), in order to be able to interpret it, you need to know what the format of this file is. Think JPEG, which has a specification, which explains how to take the bytes that constitute a JPEG file and interpret it as a picture. An unfortunate truth about text files is that a “text file” is a sequence of Unicode characters, but that is an extremely high-level, abstract definition, since it says nothing about how those characters are actually stored on disk. So, one can’t just say that a REUSE .license file is a text file, that just does not say enough about how to actually read or write such a file. You have to specify the encoding.

Now, that’s true for any text file. Before the Internet, that was not such a big issue, because everyone just had some default encoding configured on their computer and they would not need to think about it, since the files were only read and written by them on their own computer. But with the Internet, we are constantly exchanging text files and suddenly we need to know what each file’s encoding is in order to be able to read it. These days, the de-facto standard encoding for text files is UTF-8 (for many reasons, including historical, but, most importantly, because it is actually the most sensible choice).

You can also read about this here: https://serokell.io/blog/haskell-with-utf8, just to rehash what I wrote above.

Now, embedded systems. There are sort of two questions here.

The first one is software support. If you have, like, vim on your router, it might or might not be able to correctly display and edit UTF-8 encoded text. If it does, then all is fine, if it does not, then you are, ummm, out of luck? Because, like, you do not always control what is in the files that you want to edit. The space saving here can potentially come from the fact that having proper support for UTF-8 required some code (or, more likely, libraries), so some people might want to build their software without UTF-8 support.

The second one is the storage space used by files themselves. When we are talking about UTF-8, that’s not really any concern at all, since UTF-8 is very efficient and, more importantly, it gives you control: if you stay within basic latin characters, the size of your file will end up being as small as reasonably possible.

To sum up.

The specification for .license files must say what encoding is used for the text files, since otherwise it is ambiguous and those files are borderlines useless – to successfully read them, one needs to know the encoding.
The choice is really between ASCII and UTF-8. If we specify ASCII, we are limiting the content of .license files to those 128 Unicode characters that are representable in ASCII. If we specify UTF-8, we allow all of Unicode characters at the cost of those files potentially not being fully readable on systems where software has no support for UTF-8 for whatever reasons.

Lastly, the following is true: for any Unicode character, if it is representable in ASCII, its encoding in ASCII is guaranteed to be the same as its encoding in UTF-8. This provides a degree of backward-compatibility: if your UTF-8-encoded file only contains characters from the subset representable in ASCII, then you can work with it using software that does not know anything about UTF-8 and assumes ASCII encoding.

This whole embedded systems concern does not really make much sense to me, since we are talking about text files here, and, like, embedded system developers do not have complete control over text files that will be present/used in their systems, so one way or another, almost certainly there will be some files containing non-ASCII characters in those systems, that’s just... not an issue, because one can simply avoid editing those files on embedded systems. But if there is control over text files, then, indeed, is you simply make sure that your text files do go beyond the ASCII range, there will be no practical difference between a UTF-8 and ASCII encoded text.

I’m sorry my comment ended being up so long, but I just wanted to clarify the situation for everyone once and for all, since any concerns over the use of UTF-8 are, frankly, very frustrating to me, at least because that is what is used in practice anyway.

silverhook commented 2 years ago

Well put @kirelagin. I glossed over some details (and you clearly know more than I do as well). I agree that I don’t see a compelling reason to chose ASCII (or UTF-16 or UTF-32) over UTF-8 in practice.

But it is great to be equipped for the eventual comment (again) from the embedded community if it comes to that. I think we now understand enough to both 1) confidently demand UTF-8 and 2) defend that position.

fsfe / reuse-docs

Specify encoding of .license files #106