cloudflare / tableflip

Graceful process restarts in Go
BSD 3-Clause "New" or "Revised" License
2.91k stars 148 forks source link

Don't panic on parent connectivity problems #1

Closed pascaldekloe closed 6 years ago

pascaldekloe commented 6 years ago

After receiving the file name mapping from the parent process, the connectivity is read until EOF in a separate routine. Errors there should not lead to a panic as they might even occur after a successful handover, at least in theory.

I'd recommend a proper handshake with forward compatibility in mind.

lmb commented 6 years ago

Thanks for the PR! Can you describe what happened to trigger this behavior? I agree that panicking is not nice.

pascaldekloe commented 6 years ago

Hi Lorenz, I just updated the description. Is that what you had in mind as a description?

On Mon, Oct 15, 2018 at 2:21 PM Lorenz Bauer notifications@github.com wrote:

Thanks for the PR! Can you describe what happened to trigger this behavior? I agree that panicking is not nice.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cloudflare/tableflip/pull/1#issuecomment-429831360, or mute the thread https://github.com/notifications/unsubscribe-auth/AEpEALL5LR4m_z29OGJksbRI38usN6Ocks5ulH26gaJpZM4XcGul .

lmb commented 6 years ago

Ah, now things are clearer, thanks!

So this is an interesting question. The child uses io.EOF to detect when the parent has exited, to prevent additional updates from happening. The comment in https://github.com/cloudflare/tableflip/blob/master/upgrader.go#L196-L199 explains this a bit.

If we don't get io.EOF the parent has done something weird like write some more to the child. This would be a violation of the protocol, hence I wanted to opt on the safe side and make this an error.

You have a valid point that panicking at this stage is not great though, since this might kill the only running instance of a service.

pascaldekloe commented 6 years ago

Maybe we're better of with os.ProcessState.Wait then? I'll be happy to write a patch as a learning exercise?

lmb commented 6 years ago

That's an interesting idea. The documentation for https://golang.org/pkg/os/#Process.Wait says that it only works on child processes though?

On Mon, 15 Oct 2018 at 16:29, Pascal S. de Kloe notifications@github.com wrote:

Maybe we're better of with os.ProcessState.Wait then? I'll be happy to write a patch as a learning exercise?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Lorenz Bauer | Systems Engineer 25 Lavington St., London SE1 0NZ

www.cloudflare.com

pascaldekloe commented 6 years ago

You're right. I never noticed that security measure before. So now the change discards data (with complaining) and lock on I/O error, correct?

pascaldekloe commented 6 years ago

For an intend is to lock operation and force manual intervention, an infinite loop is just the thing one might think—especialy if it saves a bit of coding. I'm out for holidays now so feel free to do with the pull request as you please.

lmb commented 6 years ago

Merged via 658b336eb2dcd3ca. Thank you for contributing!

pascaldekloe commented 6 years ago

Thank you for this beautiful project. :-)

Two notes on the final solution. The reader interface does not require n to be zero on error so there's a (small) chance to miss errors like that. Also Upgrade has a race condition on parentErr now.

lmb commented 6 years ago

Thank you for your kind words!

Yes, we could miss the case of err != nil, n > 0 but i'm not terribly worried about that. For the race condition, reading and setting parentErr is protected by upgradeSem, so that shouldn't race. Am I missing something?

On Wed, 17 Oct 2018 at 18:19, Pascal S. de Kloe notifications@github.com wrote:

Thank you for this beautiful project. :-)

Two notes on the final solution. The reader interface does not require n to be zero on error so there's a (small) chance to miss errors like that. Also Upgrade has a race condition on parentErr now.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/cloudflare/tableflip/pull/1#issuecomment-430714347, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWfCAKhXuzLyU3l-Az96aZUTs8vDofIks5ul2awgaJpZM4XcGul .

-- Lorenz Bauer | Systems Engineer 25 Lavington St., London SE1 0NZ

www.cloudflare.com