dotnet / dotnet-docker

Docker images for .NET and the .NET Tools.
https://hub.docker.com/_/microsoft-dotnet
MIT License
4.47k stars 1.94k forks source link

Proposal: Enable globalization invariant mode for all runtime images #1877

Closed richlander closed 1 year ago

richlander commented 4 years ago

Proposal: Enable globalization invariant mode for all runtime images

We propose to reduce runtime images by ~12MB (compressed; ~31MB uncompressed) by no longer installing the ICU package for Debian- and Ubuntu-based images, and instead rely on globalization invariant mode, by default. The .NET runtime and libraries depend on ICU, on Linux, for globalization behaviors (sorting, time zones, currency symbols, date formats, ...). We already enable globalization invariant mode and do not install ICU with Alpine runtime images.

We propose to (A) take advantage of this size improvement for Debian and Ubuntu images, and (B) make .NET images symmetric across Linux distros. In short, we like what we did for Alpine, but no longer want Alpine to be a special case.

All Linux-based .NET SDK images will continue to contain ICU. For example, Alpine .NET SDK images contain ICU, even though Alpine runtime images do not. As a point of policy for SDK images, we value UX over size, and intend for SDK images to provide a "batteries included" model. This is, in part, because it is more inconvenient, for users, to add packages to SDK images for some scenarios. This is a tradeoff, as it adds an unfortunate point of asymmetry between runtime and SDK images, but one that we believe is warranted.

We made an analogous change in https://github.com/dotnet/dotnet-docker/pull/1848 where we removed a Debian- and Ubuntu-specific layer that Alpine did not have. After that change, Debian and Ubuntu SDK images are smaller, and the layering across .NET SDK images for Linux distros is now the same.

Context

As part of the .NET Core 2.0 release, we created globalization invariant mode. This feature, when enabled, removes any dependence on external libraries for globalization information by using the invariant behavior for all globalization-sensitive APIs (like sorting, understanding time zones and writing currency symbols). For many applications, this mode is a win because they are not dependent on globalization concepts and behaviors.

This new mode was developed at the same time as we added support for the Alpine Linux distro. The Alpine project is known for publishing small container images, and we wanted to do everything we could to make Alpine-based .NET Core images small. We decided to take advantage of globalization invariant mode and not install ICU in Alpine images by default, and instead let users who need globalization enable it for themselves. This seemed like a great trade-off at the time, and we haven't heard any negative feedback on it. We have however heard that many people are happy with .NET Alpine images, and have seen their usage grow considerably.

Size details

We built the dotnetapp sample a few different ways and published the results at richlander/dotnetapp. The tags listing provides the compressed sizes. The same images are displayed below, with uncompressed size information.

rich@mazama:/mnt/d/git/dotnet-docker/samples/dotnetapp$ docker images
REPOSITORY                              TAG                            IMAGE ID            CREATED             SIZE
richlander/dotnetapp                    debian                         c83a4ad65881        54 minutes ago      190MB
richlander/dotnetapp                    latest                         c83a4ad65881        54 minutes ago      190MB
richlander/dotnetapp                    alpine-globalization-enabled   1aa6fb6af249        2 hours ago         119MB
richlander/dotnetapp                    alpine                         75670cc0cd25        2 hours ago         87.3MB
mcr.microsoft.com/dotnet/core/runtime   3.1-alpine                     50c357d06fee        5 days ago          87.2MB
alpine                                  latest                         f70734b6a266        5 days ago          5.61MB

Legend:

stephentoub commented 4 years ago

My main concerns would be:

danmoseley commented 4 years ago
tarekgh commented 4 years ago

I don't think this will be a good idea. From what I am seeing, almost 90% of users will need to turn off the invariant mode and install the needed ICU packages. I saw some issues users had Invariant mode turned on and ran into problems that were not easy for them to figure out what is going on.

richlander commented 4 years ago

I get all of this feedback, however, Alpine usage is growing. What do we do when half of our pulls are Alpine? Would that change the dynamic?

We don't have data on whether people use Alpine images as is or add ICU on top. This is the best we have: https://github.com/search?q=ENV+DOTNET_SYSTEM_GLOBALIZATION_INVARIANT+false&type=Code

My motivation is to enable pay-for-pay, at the possible expense of extra work and some confusion. Is 12MB worth it? Yes. This win has been valuable for Alpine, and I no longer want Debian and Ubuntu to have asymmetry with Alpine. The rationale for that asymmetry isn't justified.

stephentoub commented 4 years ago

What do we do when half of our pulls are Alpine? Would that change the dynamic?

I'm missing why that's relevant. There are many other differences between the distros, no?

jkotas commented 4 years ago

I believe we would need to make changes to invariant mode to make this viable:

Alpine usage is growing.

Is there a way to get distribution between English vs. non-English speaking countries for usage of our Alpine images? My hypothesis is that our Alpine images are used relatively less in non-English speaking countries.

richlander commented 4 years ago

I'm missing why that's relevant. There are many other differences between the distros, no?

True. But this isn't one of them. It's an arbitrary choice we made for one distro.

My hypothesis is that our Alpine images are used relatively less in non-English speaking countries.

Great thought. I'll see if we have any information that can at least point us in that direction.

First, producing container images for a platform is very hard. Since docker has a single line of inheritance, you have to make a variety of trade-offs. In general, it makes sense to make to decide up-front what you value and then use that value-orientation for every single decision. Otherwise, you end up with something that has a bunch of interesting characteristics but is "blah" in aggregate. Clearly, we've decided that size is our #1 metric.

In short, you have the following three choices, pick two:

We value those attributes in that order.

From what I am seeing, almost 90% of users will need to turn off the invariant mode and install the needed ICU packages.

This is a great point. Even if 90% of users needed invariant mode disabled, I'd still have this plan. I'm focused on building a competitive product that makes .NET a great choice for those 10% of users that need the smallest size possible.

I think of this topic as being directly connected to Jan's form factors doc. Based on the way Docker works, we certainly could create multiple sets of images that effectively implement multiple form factors, but we're not going to. We're going to do one, and it's going to focus on getting images smaller and smaller.

We're going to make this change. We just need to decide when.

Let's make invariant mode better. I hadn't thought of wasm being aligned with invariant mode.

jkotas commented 4 years ago

We value those attributes in that order.

We strike balance between these attributes by having all Ubuntu-, Debian- and Alpine- based images. Why do have all 3 instead of just 1? I believe that it is because of Ubuntu and Debian ones are easier to use than the Alpine one.

GrabYourPitchforks commented 4 years ago

I saw some issues users had Invariant mode turned on and ran into problems that were not easy for them to figure out what is going on.

@tarekgh Can you give some examples? Are these things we'd be able to work around within the runtime itself? @jkotas had mentioned allowing case conversion of non-ASCII characters. If we carried this data it would only be a few KB. But if common scenarios require customers to install ICU anyway then I have a hard time justifying us carrying around our own copy of the data.

@richlander Is this part of a larger effort to shrink size-on-disk for the Alpine distro? I've had some offline conversations with folks re: having "fast" (but large) and "small" (but slower) versions of our code paths. The idea is that we'd ifdef in whichever one was appropriate for the target platform. I haven't done significant analysis on how much footprint this would save overall so I don't know if it's worth pursuing.

richlander commented 4 years ago

Why do have all 3 instead of just 1?

Ha! I wish we could have just one.

The short version is this:

It's amazing to reason about pull behavior across Docker and APT, as two examples. The patterns are super different and the OSes people prefer (in aggregate) as super different. And what people value in those modalities is super different. For example, we see pretty much constant pulls in Docker, day in, day out. For APT, we see a huge surge of pulls in the first 36 hours after a release, and then back to a much lower constant set of pulls after about 5 days.

richlander commented 4 years ago

Is this part of a larger effort to shrink size-on-disk for the Alpine distro?

No, it is specifically not that. We already did that, starting with Alpine with .NET Core 2.1. This is about applying that same win to Debian and Ubuntu.

@jkotas had mentioned allowing case conversion of non-ASCII characters. If we carried this data it would only be a few KB. But if common scenarios require customers to install ICU anyway then I have a hard time justifying us carrying around our own copy of the data.

ICU is 30MB+ (uncompressed). It's worth talking about ways to avoid it. We don't necessarily need to ship those data files in the runtime. We could download them for the Docker scenario. We download plenty of things today, at docker build time, and are happy to add more if there is value.

Also, we shouldn't be making optimization choices around small numbers of KBs to the product in isolation. On the runtime team, we blow those away with our crossgen choices (in either direction). For example, we used partial crossgen in 3.0 to save about 10MB in container images. We can pay for your data file cost with change we find behind the couch. We have a bunch more crossgen work planned for 5.0. We don't have any insight on size impact yet.

richlander commented 4 years ago

@GrabYourPitchforks -- It would be awesome to have this information:

marek-safar commented 4 years ago

I believe we would need to make changes to invariant mode to make this viable

It'd be nice to have the invariant mode more developers-friendly but at the same time as we are also having a conversation with @danmosemsft team how to make the globalization support more configurable which could help here as well. The current setup where you go either with no globalization or full-blown ICU is not enough for a growing number of form factors and scenarios .NET is targeting.

tarekgh commented 4 years ago

To answer @GrabYourPitchforks question:

@tarekgh Can you give some examples? Are these things we'd be able to work around within the runtime itself?

One example, it is reported a problem that the resource lookup is not working on one of the user machines and working fine on other machines. The user had no idea about the invariant mode and didn't know what is wrong there. Resource lookup depends on the culture parent chain which of course is not provided with the Invariant mode and the resource lookup fails to get the right resources.

GrabYourPitchforks commented 4 years ago

@richlander anything that involves non-linguistic case comparison will work. Consider the following examples.

// In Invariant mode, returns "MAñANA"  <-- note the 'ñ' was left unchanged
// Under ICU / NLS, returns "MAÑANA"
// Under invariant mode with our own casing data, returns "MAÑANA"
string result = "mañana".ToUpperInvariant();

// In Invariant mode, returns false
// Under ICU / NLS, returns true
// Under invariant mode with our own casing data, returns true
bool areEqual = string.Equals("mañana", "MAÑANA", StringComparison.OrdinalIgnoreCase);

By carrying our own casing data, we can determine that 'ñ' and 'Ñ' are actually the same character (with different casing). This means that ToUpperInvariant and string.Equals(..., OrdinalIgnoreCase) will behave as expected.

This does not include support for normalization or linguistic comparisons. Consider the following examples.

// In Invariant mode, returns false
// Under ICU / NLS, returns true
// Under invariant mode with our own casing data, returns false
bool areEqual = string.Equals("ss", "ß", StringComparison.InvariantCulture);

// In Invariant mode, returns false
// Under ICU / NLS, returns true
// Under invariant mode with our own casing data, returns false
bool areEqual = string.Equals("encyclopaedia", "encyclopædia", StringComparison.InvariantCulture);

InvariantCulture is a linguistic comparison, which means that it needs to account for the fact that "ss" and "ß" are semantically identical; as are "ae" and "æ". Our casing data does not handle these conditions.

For servers this is generally OK. Most server applications deal with things like identifiers, usernames, filenames, paths, etc.; so they should only ever be using Ordinal or OrdinalIgnoreCase, not any other StringComparison. (ToUpperInvariant and ToLowerInvariant would likewise work. Despite their names, their behavior maps roughly to OrdinalIgnoreCase and has nothing whatsoever to do with CultureInfo.InvariantCulture. It's confusing but it's what we're stuck with.)

For clients this is a bit more problematic. A client app would want localization and would want culture-aware textual analysis. If I visit https://en.wikipedia.org/wiki/Encyclopedia and CTRL-F and type "encyclopædia" into my browser's search box, I want it to find both "encyclopædia" and "encyclopaedia" on the page. Something like this would require the full power of ICU / NLS.

Servers that need to display data in a localized fashion also fall under this latter category. If the visitor is browsing from the United States, I want to display pricing using the U.S. currency symbol ('$') and with decimals formatted in a manner familiar to a U.S. audience. If the visitor is browsing from Japan, I want to display pricing using the yen currency symbol ('¥') and with digits formatted appropriately. If you need this kind of localization data, you'll require the full power of ICU / NLS.

Does this help clarify the scenarios a bit?

MichaelSimons commented 4 years ago

This isn't being considered for 5.0 but is something we are interested in driving post-5.0.

tarekgh commented 4 years ago

linking to the issue https://github.com/dotnet/runtime/issues/37349 for awareness about IDN functionality difference with the Invariant mode and potential wrong behavior in the networking stack depending on IDN.

richlander commented 4 years ago

Related (TZData): https://twitter.com/funcshawnal/status/1271825184589152256?s=21

tarekgh commented 4 years ago

Note that, having TZData is not related to enabling the Globalization invariant mode. TZData is independent bits to install to get TZ support.

richlander commented 4 years ago

Great point. It's not directly related, as you say. My point is that it is a near-neighbor problem, with similar characteristics and UX.

I'd like to start an early 6.0 proposal along the lines of Jan's comment. We should include tzdata in that.

I was just talking to the wasm team about this. They expressed that they are struggling with ICU (significantly more than the Docker scenario) and would appreciate a better solution for 6.0 that doesn't require ICU.

Cool?

tarekgh commented 4 years ago

I was just talking to the wasm team about this. They expressed that they are struggling with ICU (significantly more than the Docker scenario) and would appreciate a better solution for 6.0 that doesn't require ICU.

Is there more info here? what they are struggling with ICU? in general, it is good we start having a 6.0 proposal from now as you mentioned so we can have enough time to react to the needed change.

Yes, cool :-)

richlander commented 4 years ago

Same reason ... size impact. Size constraints of wasm are like 10x more restrictive than containers. More concretely, the wasm team is slicing and dicing ICU itself to reduce size. This isn't a great model. Mono libraries have NLS-style in-product tables/data (actually stale data copied from ICU), but the wasm project is leaving that behind since it is moving to corefx.

mthalman commented 2 years ago

@richlander - This has been dormant for a while now. Any thoughts on this for .NET 8?

tarekgh commented 2 years ago

+@steveisok @lewing to advise if they still running into the size problems.

@mthalman are you running into some issue because of the size?

steveisok commented 2 years ago

+@steveisok @lewing to advise if they still running into the size problems.

We pull in icu from dotnet/icu, so I do not think our workloads would be negatively impacted.

@lewing ?

mthalman commented 2 years ago

This would also be impacted by whatever outcome we have from https://github.com/dotnet/dotnet-docker/issues/4162. If we have a distroless Alpine offering, then we may want to make different choices with the full version of Alpine, like including icu.

richlander commented 1 year ago

We're no longer pursuing this.