Open jhart1685 opened 3 years ago
Thanks for reaching out and the great summary, @jhart1685!
I browsed a bit around and had a look. Docker is also disabling the keep alives while containerd does not. I need to look a bit closer to check whether changing that in c/image would have other implications.
@mtrmac @rhatdan WDYT?
I’ll try to check in more detail later — just as a very general point, the pkg/tlsclientconfig
was created for compatibility, but we mostly have neither the interest nor the expertise to fine-tune TCP and similar low-level layers, and especially to keep up with the changing best practices.
We probably can’t just blindly drop everything in there; overall, it does seem attractive to decrease the deviation from defaults when we don’t have a specific reason to keep deviating.
Hi @vrothberg @mtrmac - Thanks for looking. I take your point about fine-tuning TCP not being the core function of this project. Likewise, it's not my area of expertise either.
But in this case, we've tracked down one of the causes of the EOFs that make Skopeo unusable in certain situations. It looks like a relatively small modification here (albeit one still to be made carefully and with due diligence wrt testing etc.) could make the client much more tolerant of different server configurations, and provide a big win in terms of reliability.
Oh I’m absolutely not saying that we shouldn’t change the code in that package ever; I mean that it would be nice to avoid making such choices if we are not in a position to maintain them well, i.e. tend to remove code from there if possible / reasonable.
This code https://github.com/containers/image/blob/master/pkg/tlsclientconfig/tlsclientconfig.go#L100-L105
Creates an http transport with a custom dialler, which means that by default, Go will not allow the connection to be upgraded to http2 (this is explained in the documentation of the ForceAttemptHTTP2 option).
Just for the record, that would happen anyway because we always set TLSClientConfig
. So we would have to affirmatively opt into HTTP/2. (And the http
package documentation doesn’t say why TLSClientConfig
disables HTTP/2…)
This, coupled with the
DisableKeepAlives: true
option, means that the resulting connection is susceptible to being chopped quickly by the server. See istio/istio#30577 for background of this..
… and reading https://github.com/istio/istio/issues/30577#issuecomment-772120680 / https://github.com/istio/istio/issues/30577#issuecomment-772705815 , we are talking about an unreliable workaround to a heuristic, not(?) some fundamental property of TCP.
It’s seriously unclear to me why Docker and containerd did / did not make the choices they did. The decision to disable keepalives seems to come from https://github.com/moby/moby/pull/797/commits/cff3b37a61b8da883437eec799b7b852d22538f0 , which does not document any rationale, at best the containing PR vaguely suggests a direction.
Linking for context: #1083 .
Just for the record, that would happen anyway because we always set TLSClientConfig. So we would have to affirmatively opt into HTTP/2. (And the http package documentation doesn’t say why TLSClientConfig disables HTTP/2…)
Hmm - yes, could be that my simplified recreate moved too far away from this code and introduced that different dimension.
… and reading istio/istio#30577 (comment) / istio/istio#30577 (comment) , we are talking about an unreliable workaround to a heuristic, not(?) some fundamental property of TCP.
Yes. I think the server-side configuration to mitigate this is an unreliable workaround. The client modifications being discussed here are more like defensive coding, to harden the code against a variety of server types and configurations.
As of #1867, nothing in c/image disables HTTP keep-alive; so, per the initial motivation,
But the frequency of these EOFs can be much reduced by implementing one of the following options:
that should now work better.
HTTP/2 is still effectively opted out because we set DialContext
.
This code https://github.com/containers/image/blob/master/pkg/tlsclientconfig/tlsclientconfig.go#L100-L105
Creates an http transport with a custom dialler, which means that by default, Go will not allow the connection to be upgraded to http2 (this is explained in the documentation of the ForceAttemptHTTP2 option).
This, coupled with the
DisableKeepAlives: true
option, means that the resulting connection is susceptible to being chopped quickly by the server. See https://github.com/istio/istio/issues/30577 for background of this.. But essentially, if you are doing (for example) a skopeo copy of an image between two registries, and the source registry is hosted behind an istio ingress, the image read is likely to get terminated early with an EOF.I've seen other issues in this repo around EOFs, improving diagnostics and logging of them for example. But the frequency of these EOFs can be much reduced by implementing one of the following options:
ForceAttemptHTTP2
option.