Use large MTU by default for pasta-backed rootless custom networks

dgibson commented 1 month ago

Feature request description

By default pasta uses an MTU of 65520 bytes in containers it backs. This is an important strategy to improve TCP throughput and reduce load, by reducing the number of system calls. pasta is able to coalesce individual TCP packets, allowing it to take advantage of the large local MTU, even if the full path has a lower MTU (which will be typical across the internet).

However, when pasta is used for a rootless custom network, a Linux bridge sits between pasta and the container(s) it's supporting. Unless overridden in the podman configuration this bridge will have the default MTU of 1500, negating this performance strategy of pasta.

Increasing the MTU of the custom network (e.g. with podman network create -o mtu=65520) can significantly improve performance in some situations.

Suggest potential solution

When creating a custom network which uses pasta for external connectivity, podman should default to configuring an MTU of 65520.

This won't help in all cases: if traffic is not coming directly from the container, but from (for example) a tunnel running in the container, the TCP MSS will still be constrained by the tunnel's MTU. Nonetheless a different default will help the common case of TCP traffic originating directly in the container.

Have you considered any alternatives?

The end user can, of course, manually set a large MTU, but that's extra inconvenience.

While, of course, we endeavour to keep performance of pasta good even with smaller MTUs, the large MTU strategy is an important tool which it seems unwise to discard.

Additional context

This limitation came to light amidst discussion of a number of issues occuring in this ticket.

dgibson commented 1 month ago

/cc @Luap99

Luap99 commented 1 month ago

While injecting mtu=65520 as option is easy it will break consumers that inspect the networks and expect certain options to be (or not be) there... Doing something like this we broke docker-compose in the past as it thinks it always has to recreate the network because of this. I think the podman ansible roles behave similar so just adding the option in there by default is not good.

Netavark itself doesn't set a mtu unless it was set in the network config file and the kernel defaults to 1500 AFAIK. This causes already problems for some users (https://github.com/containers/podman/issues/20009). Therefore adding a generic default_mtu option sounds reasonable to me to solve this and then we could have rootless default to 65520.

But if we do not want to add it in the network config used by inspect/on disk then we either have to add a new option to netavark to send the default mtu there or before sending the networking config to netavark we copy the config add the mtu option the send it to netavark in c/common/libnetwork. As this doesn't require netavark changes this seems likely the easier option.

Also there is the question about slirp4netns support? Our slirp4netns code also defaults to 65520 so I guess we would not have to make a difference and could just use the same default there. And we have to consider backwards compat as well, if a user today has mtu 1500 configured for pasta or for slirp4netns and we then default to higher mtu on for the bridge networks all of the sudden it may negatively impact them.

Overall I would like to see a proper benchmark with throughput/cpu numbers to see how important this is.

sbrivio-rh commented 1 month ago

Overall I would like to see a proper benchmark with throughput/cpu numbers to see how important this is.

Third table ("pasta: connections/traffic via tap" ) here: https://passt.top/passt/about/#performance_1. Those figures are taken, however, with CPU-bound pasta, at much higher transfer rates than what you'd find in common use cases.

The effect on reported CPU load with lower transfer rates is more dramatic than that because CPU load scales much less than linearly. If there's less data available to transfer at a time, we'll use (many) more CPU cycles per byte. But I don't have hard numbers here, yet.

dgibson commented 1 month ago

/cc @Luap99

While injecting mtu=65520 as option is easy it will break consumers that inspect the networks and expect certain options to be (or not be) there... Doing something like this we broke docker-compose in the past as it thinks it always has to recreate the network because of this. I think the podman ansible roles behave similar so just adding the option in there by default is not good.

Heh.

Netavark itself doesn't set a mtu unless it was set in the network config file and the kernel defaults to 1500 AFAIK.

Kernel defaults will depend on the exact interface types, but typically it will be 1500, yes.

This causes already problems for some users (#20009). Therefore adding a generic default_mtu option sounds reasonable to me to solve this and then we could have rootless default to 65520.

Ok.. I'm not totally clear on what the difference is between this and the first option you rejected.

But if we do not want to add it in the network config used by inspect/on disk then we either have to add a new option to netavark to send the default mtu there or before sending the networking config to netavark we copy the config add the mtu option the send it to netavark in c/common/libnetwork. As this doesn't require netavark changes this seems likely the easier option.

Also there is the question about slirp4netns support? Our slirp4netns code also defaults to 65520 so I guess we would not have to make a difference and could just use the same default there.

I'm guessing you mean it defaults to that MTU with slirp4netns itself? Presumably slirp4netns combined with a custom network will hit the same issue as I'm describing here.

And we have to consider backwards compat as well, if a user today has mtu 1500 configured for pasta or for slirp4netns and we then default to higher mtu on for the bridge networks all of the sudden it may negatively impact them.

Hard to see how, but yes, that's possible in principle.

Overall I would like to see a proper benchmark with throughput/cpu numbers to see how important this is.

It's only a single data point, but a real user reports a fairly noticeable difference here.

Luap99 commented 1 month ago

Ok.. I'm not totally clear on what the difference is between this and the first option you rejected.

Basically it comes down to not showing the mtu option when you do podman network inspect. Ansible and docker-compose will only recreate resources when something was changed in the config files (ansible calls this idempotency). So if we always add mtu to the options and show it in inspect the next time the tool runs it thinks the user changed settings (because mtu is not in their config) and has to recreate the network and thus all containers depending on it which is not wanted. And for docker-compose at least we cannot even tell the tool to handle this special case as they only target docker. So podman must behave like docker at the compat API level.

With the default_mtu option in containers.conf I would not add the option into the actual network config file json thus avoiding the problem with it showing in inspect.

dgibson commented 1 month ago

Ok.. I'm not totally clear on what the difference is between this and the first option you rejected.

Basically it comes down to not showing the mtu option when you do podman network inspect. Ansible and docker-compose will only recreate resources when something was changed in the config files (ansible calls this idempotency). So if we always add mtu to the options and show it in inspect the next time the tool runs it thinks the user changed settings (because mtu is not in their config) and has to recreate the network and thus all containers depending on it which is not wanted. And for docker-compose at least we cannot even tell the tool to handle this special case as they only target docker. So podman must behave like docker at the compat API level.

With the default_mtu option in containers.conf I would not add the option into the actual network config file json thus avoiding the problem with it showing in inspect.

Ok. Seems like a perfectly reasonable approach from my point of view.

containers / podman