docker / cli

The Docker CLI
Apache License 2.0
4.84k stars 1.91k forks source link

For units of digital information, IEC (ISO/IEC 80000) values are used with SI prefixes (incompatible!) #4630

Open Andrew15-5 opened 10 months ago

Andrew15-5 commented 10 months ago

Description

I'm a "hunter" of the legacy SI suffixes (which are unfortunately used till this day for size of digital information). I've noticed that when pushing a Docker image to registry, "kB" and "MB" prefixes were used to indicate the size of layers that are being uploaded. Thankfully, kB is used instead of non-existing KB (some software does use KB), which means that all the units are SI units (i.e., "kilo-", "mega-" etc.).

I investigated the repo a bit, and although I didn't find a place/function where bytes are converted to human-readable units, I did find a few lines of code scattered across multiple files where conversion is made out of 1024 (things like 1024 * 1024). This means that the values that are being computed are multiples of 2 (or multiples of 2¹⁰). Such values are prefixed with binary prefixes that are also can be called IEC prefixes. See: https://en.wikipedia.org/wiki/Binary_prefix.

Basically, the values (numbers) are computed as a binary/IEC values (multiples of 2), but prefixes are printed/shown as a decimal/SI prefixes (multiples of 10). I think I made it clear that this is wrong, you can't mix them together.

Now, since 1024 is used (everywhere in this repo) for computing values, I think we can all agree that this is the right (and only) way to compute such values (I use 1024 too). When using SI units, the bytes must be divided by 1000, when using IEC units — by 1024. So, 1 kB (kilobyte) is 1000 B (bytes) and 1 KiB (kibibyte) is 1024 B (bytes). And here we are finally arriving to the main problem: kB are used instead of KiB, MB — instead of MiB and so on.

TL;DR is that units of size must be fixed/replaced by the correct IEC units. Not taking into consideration the fact that SI units should/must not be used at all regardless, the main issue is that SI units are confusing (and therefore are legacy units)! You can't tell if 1 MB is 1000² bytes or 1024² bytes, because a lot of people were taught that 1 kilobyte is 1024 bytes (majority most likely doesn't even know what kibibyte is!). IEC prefixes solve this confusion: if you see 1 MiB, you know for a fact that it is 1024² bytes.

The rules also state that values and unit must be separated by a (white)space, but in the Docker CLI no spaces are used. There is a lot of software that uses IEC prefixes, I can share my list (of such software) that I recently started (which will be expanded over time). Each project completely follows the rules (space and correct IEC prefixes):

In every project except git (I didn't even look into the source code, since there is no need) a single convert function is used to convert bytes to an appropriate unit. So the solution was simple in each and one of them. With Docker, it looks like much more work must be done to achieve the goal. Therefore, I can't provide any suggestions on how to go about solving the problem, but I can contribute. I guess we just have to find every place in the codebase where units are printed and fix them.

desmond3th commented 10 months ago

I want to help with this issue, but I'd like to have some guidance on how to approach it effectively

Andrew15-5 commented 10 months ago

Since I've only seen the codebase once and (I don't know Go), I assume that there are a lot of places that must be fixed. And based on this assumption I only can purpose an easy approach which could take a while (or not), and also we can instantly start fixing.

Just find places where units are printed/composed (including code comments, if exist) and start fixing. This will ensure that some progress is already being made, and if a more effective approach will arise, we will switch to it instead.

Andrew15-5 commented 10 months ago

Here is my workaround/hack for listing images with correct units (works on the most widespread image sizes: MB and GB):

alias dimgs='docker images | sed -E '\''s/([MG])B$/ \1iB/'\'
thaJeztah commented 9 months ago

Thank for reporting.

Yes, there's a good amount of mixing/mashing of these units, and I think the CLI uses the wrong units in various places; for example, docker image ls shows GB (capital G), but uses metic values;

docker image ls
REPOSITORY                        TAG       IMAGE ID       CREATED        SIZE
docker-dev                        latest    d9bad5aa751e   2 days ago     2.02GB
curl -s --unix-socket /var/run/docker.sock http://localhost/v1.43/images/json | jq .[].Size
2015181014

(lower / uppercase has been a long used alternative for kiB vs kB)

What's adding to the confusion is that other command-line tools (such as ls) don't use metric values; taking a random file in my working directory;

ls -l golist.json
-rw-r--r--  1 thajeztah  staff  3079830 Dec 14 17:02 golist.json

ls -lh golist.json
-rw-r--r--  1 thajeztah  staff   2.9M Dec 14 17:02 golist.json

So we should definitely look at consistency, but we may have to be careful changing things, as there may (likely will) be users depending on the current behavior (even if faulty).

Andrew15-5 commented 9 months ago

What's adding to the confusion is that other command-line tools (such as ls) don't use metric values

In https://github.com/gokcehan/lf/issues/1098 we have noticed that some popular/standard programs output different values (not just two). This partially can be explained by the fact that many programs have existed long before SI units were actually deprecated (probably).

Related, but not important.

You think that `ls` is making things confusing? Ha! Let's look at my ever expanding list of programs/apps that use SI (either real or IEC-like) units: - Transmission - Flatpak - docker - hub.docker.com - cargo - pip - Telegram - mpv - accrescent - wget - element.io - firefox - tampermonkey - Google Play - Discord - GitHub - typst.app You can throw in here YouTube (stats for nerds) and Yandex.Disk (I spoke with Yandex support about changing the units, but only time will tell if they are actually going to change them, because they **did** say that they will consider this matter in the future). (I intend going through the list, by opening tickets/creating PRs for every item, but I don't have time for this right now.) Although, this list doesn't mean that there are no good examples. Here is a list of apps with default IEC units so far (some can change between different units): - [git](https://git-scm.com/) (rounds to 2 digits, with whitespace between value and unit) - [crates.io](https://crates.io/) (rounds up to 2 digits, with whitespace between value and unit) - [nvim-tree.lua](https://github.com/nvim-tree/nvim-tree.lua) (rounds up to 2 digits, with whitespace between value and unit) - [syncthing.net](https://syncthing.net/) (rounds up to 2 digits, with whitespace between value and unit) - [flathub.org](https://flathub.org/) (rounds up to 2 digits, with whitespace between value and unit) - [gradle](https://gradle.org/) (rounds to 1 digit, with whitespace between value and unit) - [btop](https://github.com/aristocratos/btop) (rounds to 2 digits, with whitespace between value and unit) - [nix](https://nixos.wiki/wiki/Nix_package_manager) (rounds to 1 digit, with whitespace between value and unit) - [okular](https://okular.kde.org/) (rounds up to 1 digit, with whitespace between value and unit) - [rustup](https://rustup.rs/) (rounds up to 1 digit, with whitespace between value and unit) - [typst](https://typst.app/) (rounds up to 1 digit, with whitespace between value and unit) I just noticed, that I already gave (an outdated version of) this list in the OP, but now it's much bigger.

we may have to be careful changing things, as there may (likely will) be users depending on the current behavior (even if faulty).

This is obviously one of the side effects that we will face, but IMO, just making a "BREAKING CHANGE" changelog note about making information units consistent across the Docker CLI should be more than enough. Besides, even though with such a big community where some people will depend on current units, I doubt that there are many such examples.

thaJeztah commented 9 months ago

Related, but not important.

Adding to the above; macOS uses metric valuees in its GUI, binary values on the command-line; taking the same example file as my previous comment, and viewing it in the "finder";

Screenshot 2023-12-18 at 09 26 26 Screenshot 2023-12-18 at 09 27 10

I 💯 agree with the general issue (inconsistency, as well as some being wrong). I think some of these cases were "somewhat" known, but have been lingering around for a long time (which made it more difficult to correct course). We need to carefully evaluate all places where sizes are presented, and verify what units are used (and if their presentation is correct).

Some of that may be more involved as there's been some bad choices in the past where "presentation" made its way into the API, and the API returning "pre-formatted" values. There may also be parts where calculations / presentation crosses project boundaries (docker cli -> engine API -> containerd API) that need to be looked at.

This is obviously one of the side effects that we will face, but IMO, just making a "BREAKING CHANGE" changelog note about making information units consistent across the Docker CLI should be more than enough.

In an ideal world; yes! And technically, the CLI output is not part of a formal API (the API itself is), so "any change" would be technically allowed. Unfortunately, experience taught us that users tend to ignore release notes, and with 10's of millions of monthly installs, we need to be careful making such changes.

All that said; we are working on a large technical change in the Docker Engine (switching storage to use the containerd iamge store / snapshotters), which also involves various changes related to "Size" ("Size" is more complicated with the containerd image store integration, as there's multiple sizes involved (multiple architectures of an image; both "extracted" and "compressed" data). The current presentation of a single "Size" may no longer be sufficient for all scenarios, so we will need more granular information to capture this. Similar changes are needed for other parts of the API (and CLI), which will involve "breaking" changes.

We are planning to group those changes, and make a "one-time" big breaking change (instead of granular "paper-cut" changes); that would be a good opportunity to include these changes related to "Size" presentation as a whole.

Andrew15-5 commented 9 months ago

Adding to the above; macOS uses metric valuees in its GUI, binary values on the command-line; taking the same example file as my previous comment, and viewing it in the "finder";

Since we are talking OSes, when I used Windows for 5 seconds (not even mine, I'm a Linux enjoyer btw) I noticed that it (as with many other examples) also uses "all caps + no i" representation. But by simply using numfmt --to-=iec-i I found out that the values are in fact IEC ones, so it also "suffers" from the same problem.

Size: 1.41 GB (1 519 392 000 byte)
On disk: 1.41 GB (1 520 369 664 byte)

There may also be parts where calculations / presentation crosses project boundaries (docker cli -> engine API -> containerd API) that need to be looked at.

Oh, I'm almost certain there are some instances. But this also spreads to other Docker projects, like Docker Hub and probably others. But since I only really use CLI, I opened an issue over here (at first I was confused as to where should I open it).

We are planning to group those changes, and make a "one-time" big breaking change (instead of granular "paper-cut" changes); that would be a good opportunity to include these changes related to "Size" presentation as a whole.

I fully agree that a lot of small changes won't do much good. So a single big one would make more sense. One concern I have here is a possibility of delay if one or another thing will be complete much earlier than the other. In other words, if fixing units will take longer, then it can prolong a release of the storage switch update. If there are any concrete deadlines.

Andrew15-5 commented 8 months ago

I want to add a small correction. Apparently using IEC values with SI prefixes is not 100% incorrect or incompatible, I'm talking about JEDEC standard (https://en.wikipedia.org/wiki/JEDEC_memory_standards). IIUC, if there was no JEDEC standard, then we probably never would've had "2 meanings" for SI units. But they did it "prior to the 1999 IEC standard", so it is outdated. Nevertheless, it's the dominating way majority of programs are written (others use SI/IEC). And I'll try and fix that in the next years. :)