Changing sconf not always propagated

dotsimon commented 6 years ago

When changing sconf, acceptors that are neither Ready nor Last can continue to receive and process requests using the old sconf.

If appmods and/or opaque are changed then this could lead to quite incorrect responses. On busy servers, these old acceptors might never get killed by the periodic cleanup.

https://github.com/dotsimon/yaws/tree/sconf_race suggests a solution that preserves the "soft-close" but should ultimately terminate those old acceptors.

vinoski commented 6 years ago

Thanks, I'll take a look.

vinoski commented 6 years ago

How do we duplicate the issue? How do we test the suggested fix?

dotsimon commented 6 years ago

Hi Steve. Sorry for not including test cases, but this was found in a regression test where yaws is only a small part. But consider this case where an embedded server is started with one appmod: a. The gserv process starts an acceptor (acc1), therefore in gserv_loop Last = acc1 and Ready = [] A request to /a/... is made. acc1 sends next to gserv which starts a new acceptor (acc2) such that now Last = acc1 and Ready = [] acc1 calls a:out/1 which in turn updates the sconf to add appmod b gserv calls stop_ready which kills acc2 and then starts a new acceptor acc3 acc1 eventually finishes the request and sends done_client to gserv which adds the pid to Ready. A request now to /b/... gets handled by acc3 ok The next request to /b/ gets 404:ed by acc1 because it still has the old sconf

A variation on this is that accX has already served a request and is in gen_tcp:accept and in gserv's Ready, So in the original code accX still has trap_exit true making the exit in stop_ready ineffective and accX can therefore serve one more request (with the old sconf) before actually dying.

A worse case is like the above but accX is Last eventually dying due to a previous queued EXIT leaving gserv in a state where it thinks accX is waiting in accept but actually there are no processes to serve any requests!

My fixes were:

move trap_exit so that EXIT during gen_tcp:accept would actually kill the process
ignore done_client from acceptors that had been previously unlinked in stop_ready
in stop_ready also handle the clients processing requests

In doing this, I did consider if having a client list instead of relying on process_info(links) would be better. And also if there was a point to sending Pid ! {Top, stop} when the exit(shutdown, Pid) had to be handled in the same way anyway in acceptor0 and aloop.

But, since I don't have a standalone yaws or (sadly) the time to develop test cases I did not want to make any significant reworking. For the same reasons, I did not make a PR from the sconf_race branch.

I hope this has been somewhat helpful. Please let me know if I can help clarify anything.

/Simon

vinoski commented 6 years ago

Thanks for the details, Simon. This information looks like it will be helpful.

erlyaws / yaws

Changing sconf not always propagated #346