Closed PiotrSikora closed 5 years ago
For the record, initial_fetch_timeout
from #6048 mitigates this issue, and all listeners start listening after the configured timeout, with the listener waiting for SDS rejecting connections due to using NotReadySslSocket
transport socket.
This can be tracked in stats by looking at downstream_context_secrets_not_ready
, e.g. listener.0.0.0.0_9443.server_ssl_socket_factory.downstream_context_secrets_not_ready
.
@PiotrSikora where do you see the listener marked as "active"? That might be a bug or something we can expose better in stats/admin/etc.
The 2nd part AFAICT is WAI, at least as expected today. The way Envoy init works is that everything goes through the init process before listeners are started and actually accept connections. This is so that we make sure the server is fully initialized prior to accepting traffic.
We could obviously relax this so that listeners start taking traffic as their dependencies are initialized, but this would be a non-trivial code change. Perhaps open an explicit enhancement ticket on this part?
@PiotrSikora where do you see the listener marked as "active"? That might be a bug or something we can expose better in stats/admin/etc.
Those listeners are listed under dynamic_active_listeners
and not dynamic_warming_listeners
in the /config_dump
:
$ curl -s 127.0.0.1:9901/config_dump | jq -r ".configs[].dynamic_active_listeners[]? | .listener.address"
{
"socket_address": {
"address": "0.0.0.0",
"port_value": 9080
}
}
{
"socket_address": {
"address": "0.0.0.0",
"port_value": 9443
}
}
Is there a better place to look for the status? In any case, the /config_dump
seems like a bug.
The 2nd part AFAICT is WAI, at least as expected today. The way Envoy init works is that everything goes through the init process before listeners are started and actually accept connections. This is so that we make sure the server is fully initialized prior to accepting traffic.
We could obviously relax this so that listeners start taking traffic as their dependencies are initialized, but this would be a non-trivial code change. Perhaps open an explicit enhancement ticket on this part?
Ah, I incorrectly assumed that this was already the case, thanks!
But wouldn't this be effectively the enhancement ticket?
In any case, the /config_dump seems like a bug.
It's an artifact of how warming works pre and post init. I agree we can do better here.
But wouldn't this be effectively the enhancement ticket?
It's hard for viewers to look at a giant ticket like this and grok what is going on. Now that we understand both issues, I would suggest opening two details tickets for each sub-issue that we can then action on, as they are independent issues (mostly).
Having listeners warm independently also better lines up with the goal of better handling of partial xDS update failure handling (e.g. having a partial RDS update be able to unblock a bunch of listeners). So, +1 to this.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This might happen if the listener is "added" but not at "replaced". If there is no draining listener at this port, it's nice to have the listen() before it's ready to server. The benefit is that the incoming connection are queued instead of rejected.
Maybe we should give the listeners in this state a new name instead of "active".
@PiotrSikora asked me to take a look, I go over the code and chatted with @silentdai , here're some thoughts
(scope only to why active_listeners_
are actually not really active, and confusion in config_dump, I agree it's better to have a separate issue for independent update, will file that)
active_listeners_
or warming_listeners_
depends on workers_started_
... Couldn't just be always adding to warming_listeners if not exists?
Then later on when listeners get warmed, they can still be moved to active_listeners list.
This is some discussion on original PR
https://github.com/envoyproxy/envoy/pull/1215/files#r126001538
We know that before workers started, Envoy use global init manager, after that, each listener use its own init manager. But that seems independent of the whether adding to warming
If we just always add listeners to warming_listeners in that line, and make changes to in listeners manager when finding/updating the list, things should work? I'm thinking to just change that line and see if any legitimate test case will break.
@mattklein123 Please let us know if above thinking is correct or not, or we miss some historical edge/tricky context, thanks!
I think @asraa also bumped into something in this area the other day while developing a dynamic xDS fuzzer; there is some interaction between workers started and listener warming that can lead to a crash in early initialization phases.
Yes! Sort of. I set up a crash by removing a listener before worker threads are started.
I think this happens because:
I'll set up a separate issue detailing the test case.
@asraa xDS fuzzer would be helpful! I am making changes to listener. Tests are breaking (including crash and expectations doesn't reach. Even with the crashing test It could be the test cases access the invalid address in the way that prod code wouldn't. I would be much more confident on my code if I could spam envoy with kinds of config.
@mattklein123 Please let us know if above thinking is correct or not, or we miss some historical edge/tricky context, thanks!
I would need to go back through all this code. At a high level, I agree that during the first init/boot we can/should be using the warming/active lists. I think I probably did it the way that it is initially just because it was simpler for some reason, possibly due to global init manager processing. Unfortunately it's been long enough that I would have to go through and refresh my memory again. I'm happy to help review any changes in this area of someone wants to tackle.
Thanks @mattklein123 for your feedback.
I took another round of code scanning today, so far I haven't found anything that prevent us adding listeners to warming_listeners_
all the time. @asraa I think this might fix the draining listeners issue as well, since listeners are waiting in the warming list, won't be drained...
I'll try that approach, see what breaks, and get back to this thread if anything is unclear to me.
I see the same same issue that global init manager is not yet initializing when the registered targets are ready.
My solution is to add a ready
flag for InitTarget, so that if the target is ready when initializing, the init manager invoke additional ready().
It might be reasonable to make ready
an internal state in InitTarget
. @mattklein123 Do you think a InitTarget that support invoke "ready" before initialization would be useful?
It might be reasonable to make ready an internal state in InitTarget. @mattklein123 Do you think a InitTarget that support invoke "ready" before initialization would be useful?
IIRC the way this is handled today is that the target should just immediately call the ready callback if it's ready. I believe this is done elsewhere. If that is not working I think we should debug why?
https://github.com/envoyproxy/envoy/blob/e1450a1ca006f7d8b118a63b0f7e30be2639b881/source/common/init/target_impl.h#L71-L72 If the comment is true, InitManager should never knew if the ready() is invoked before.
I can add another Target impl as so far I am the one who really need it. I believe this logic could simplify the existing code as well.
I'm not sure that comment is correct. IIRC a target can just call ready() from within the initialize callback, which presumably it can do if it's already ready. cc @mergeconflict who recently worked on this code and should be able to comment.
@mattklein123: @silentdai is right. If initialize
hasn't yet been called on a given target, it won't have a watcher_handle_
(set here and checked here), so nothing will happen. I sort of like @silentdai's idea of adding a ready_
flag, so that if the target becomes ready before initialize
is called, it will immediately call the init manager's watcher back rather than doing nothing. It's a change in semantics, but one which nobody should be adversely affected by. I think I'd prefer that instead of adding a new target impl.
Sorry what I'm suggesting is to just immediately call ready()
within the call stack of the initialize()
call. That will work, right?
Sorry what I'm suggesting is to just immediately call ready() within the call stack of the initialize() call. That will work, right?
Oh, my bad, I forgot to address your comment. That's one of the use cases I originally intended, and it does work (see e.g. this test), but I'm assuming that @silentdai has some use case where he has to call ready()
before initialize()
is called. Is that right?
Back to the original issue, I talked with @silentdai and now have more understanding.
startWorkers
, and add listeners to active_listeners
to block the workers threads start. So as @mattklein123 mentioned, this is work as intended right now. And I kinda agree, starting from a known good state is reasonable.Not sure if we should proceed with my proposal above. If not, we still need to clarify in config_dump, have special logic to print active_listeners to not confuse users.
@mattklein123 @PiotrSikora @htuch WDTY?
@mattklein123 @PiotrSikora @htuch WDTY?
We definitely have to block server start so we must use the global init manager. That can't change. However it would definitely be possible to make the code more complex and switch to using both the warming/active lists, and make sure that we take care of removal correctly during global init. This is definitely a bug and we should handle it. (This is the issue @asraa is seeing.)
Personally, I would probably focus on two separate things here: 1) Let's clarify config_dump so that it's more clear what is going on during init. 2) Let's fix the fast add/remove bugs somehow.
I would have to spend more time going back through the code to suggest specific fixes, but my advice is to iterate and let's come up with simple solutions to (1) and (2) above and go from there?
@mergeconflict Yes, I have one use case. I have subscription callback(more precisely the filter chain update) as a dependency of listener warm up, which indicates it should register as a target. I don't want to couple the subscription with listener initialization (which is only late initialized at server start up but not in other cases). A new target implementation would be working like a charm.
Closing this issue since the two problems proposed from this issue are both resolved.
2. And init all listeners all together before started is a WAI behavior.
Why is this WAI?
Why is this WAI?
It's WAI as the code was written today. Please open up a fresh issue on making this part better so we can keep things straightforward?
Just want to make sure I understand what we agree on
warming_list
just to avoid confusion, but still block the workers/envoy started for those first-received listeners to be ready/warmed.1 makes real behavior changes, 2. only makes code/implementation easier to follow.
Correct. If we want to track (1) please open a new issue on that.
Description:
All listeners from LDS are marked as "active", but they don't even
listen()
if at least one of them is waiting for an SDS update.I believe there are 2 issues here, really: 1) All listeners are marked as "active" instead of "warming", 2) All listeners are blocked on a single listener in a warming state.
Repro steps / Admin and Stats Output:
Start "fake" xDS server that will accept connection, but will never respond:
Start Envoy:
Verify that listeners from LDS are configured, and marked as active and not warming:
Try to connect to both listeners, discover that neither is listening:
Config:
Logs:
cc @mattklein123 @htuch @duderino @costinm @jplevyak @silentdai @philrud @howardjohn @wattli @quanjielin @JimmyCYJ