containerd / nri

Node Resource Interface
Apache License 2.0
220 stars 58 forks source link

stub.go calls os.Exit(0) on lost connection #74

Open champtar opened 3 months ago

champtar commented 3 months ago

If your NRI plugin timeouts, containerd will close the connection, and your NRI plugin os.Exit(0) without a chance to log anything

https://github.com/containerd/nri/blob/53d3371559b3aedf4f491c33dc638fe535cc37ea/pkg/stub/stub.go#L519-L528

@klihub do you remember what was the logic behind this ? I don't think libraries should ever call os.Exit()

Also stub.Run hides ttrpc.ErrServerClosed https://github.com/containerd/nri/blob/53d3371559b3aedf4f491c33dc638fe535cc37ea/pkg/stub/stub.go#L436-L450

Maybe Run(ctx) should only return nil if we call Stop() and/or cancel the context ?

klihub commented 3 months ago

If your NRI plugin timeouts, containerd will close the connection, and your NRI plugin os.Exit(0) without a chance to log anything

Yes, but only if the plugin does not have an onClose()-handler.

https://github.com/containerd/nri/blob/53d3371559b3aedf4f491c33dc638fe535cc37ea/pkg/stub/stub.go#L519-L528

@klihub do you remember what was the logic behind this ? I don't think libraries should ever call os.Exit()

The logic was roughly: If the plugin has an onClose()-handler it wants to get notified about the lost connection, so we gather it does know how to and wants to deal with the situation on its own. But if the plugin ignores lost connection (does not have an onClose()-handler), we do an os.Exit().

Also stub.Run hides ttrpc.ErrServerClosed

https://github.com/containerd/nri/blob/53d3371559b3aedf4f491c33dc638fe535cc37ea/pkg/stub/stub.go#L436-L450

Maybe Run(ctx) should only return nil if we call Stop() and/or cancel the context ?

Hmm... I think we ignore ttrpc.ErrServerClosed as a cheap way of differentiating between a failed Start() and a successful run terminated by the runtime being shut down. But maybe we should do it differently, for instance by defining in stub a ErrFailedToStart = errors.New("failed to start") then always return a fmt.Errorf("%w: %w", ErrFailedToStart, err) for all errors from Start() and return any other/later errors as such. Then one could differentiate between startup- and other runtime errors by checking it with errors.Is(err, stub.ErrFailedToStart), if that is really necessary.

champtar commented 3 months ago

The problem here is that you/the current code assumes lost connection is a normal shutdown. For clean shutdown, containerd (nri adaptation) should do an rpc call to ask the plugin to shutdown itself, so we are sure ErrServerClosed means timeout or containerd crashing.

klihub commented 3 months ago

The problem here is that you/the current code assumes lost connection is a normal shutdown. For clean shutdown, containerd (nri adaptation) should do an rpc call to ask the plugin to shutdown itself, so we are sure ErrServerClosed means timeout or containerd crashing.

But is the Exit(0)-logic really a problem ? If you don't want the stub to exit your plugin when the connection is lost, yet you are not interested in the fact itself, you can simply instantiate the stub with an extra stub.WithOnClose(func (){}) option.

Unless your plugin is the kind which does unsolicited updates, if you don't register an OnClose-handler, you will never realize that you have lost the connection. So IMO, the least we must do in the stub is to exit runtime-launched plugins when the connection is lost since those cannot re-establish their connection. Instead of that, now we exit all plugins which have no OnClose handler.

klihub commented 3 months ago

About swallowing ErrServerClosed in Run() once the stub has been successfully Start()ed, I think it should be safe to get rid of that. Just need to check if I rely on that behavior in some tests or integration tests and fix them if I do...