edgexfoundry / device-sdk-go

Owner: Device WG
Apache License 2.0
89 stars 126 forks source link

Discovery issues surfaced during device-rfid development #1001

Closed tonyespy closed 2 years ago

tonyespy commented 3 years ago

version(s), 1.4, 2.0

This issue has been created as a result of a recent device-rfid-llrp-go PR which surfaced some issues with the existing discovery mechanism in the SDK, and some re-factoring which was done in response to issues raised by the device-rfid-llrp-go developers.

The following is taken from a comment I made in the PR reference above:

Finally, the locking issue (edgexfoundry/device-sdk-go#609) is a bit puzzling. I believe the original design was correct and that releasing the lock when the SDK read a list of devices from the channel, was intentional. Discovery() was never meant to be a long-running operation that trickled up devices over a long period of time, it was intended to be called and result in a batch of discovered devices to be returned. That said, it clearly wasn't well-documented in the device-sdk-go's godoc for the ProtocolDiscovery.Discover() method:

// Discover triggers protocol specific device discovery, asynchronously // writes the results to the channel which is passed to the implementation // via ProtocolDriver.Initialize(). The results may be added to the device service // based on a set of acceptance criteria (i.e. Provision Watchers).

One could argue that details about locking/synchronization in shouldn't be exposed in a public API, however there's no guidance to developers as to the fact that only a single instance of discovery can be running at a given time and that the Discover() method must write data to the channel in order signal completion of the operation. If this had been clearly communicated, there would've been no reason to re-factor the SDK to use a wrapper function to release the lock on completion of the Discovery() method. The original implementation was much cleaner/correct Go IMHOP.

And finally, I'll mention that when the SDK was re-factored to "fix" this issue, the godoc still wasn't updated. I think it would be a good idea for us to do so... @iain-anderson @cloudxxx8

tonyespy commented 3 years ago

This comment was written by @ajcasagrande in the afore-mentioned PR (which has already been merged).

@tonyespy I can't comment on the re-factored version as I did not code-review it, but I can attest that the original code was not very pragmatic Go IMHO.

As a developer I ran into many issues implementing the Discovery method due to the SDK (not just related to the locking), at which point I looked to the source to see why my code was not working as expected. From an outsider looking in, I was confused by the implementation as it did not seem to make sense.

  1. Ask yourself which of these statements sound like what the SDK wants to accomplish:

    1. I want subsequent calls to Discover() to be ignored as long as the first call to Discover() is still running.
    2. I want subsequent calls to Discover() to be ignored unless I have received data on a long-lived semi-global deviceCh that I gave them upon Initialization?
  2. Why would you release a lock inside of a select statement that is nowhere near where the lock was obtained (or even same go package)? There is zero traceability and opens up the gates for resource leaking and/or deadlocks, etc.

  3. Discovery ... was intended to be called and result in a batch of discovered devices to be returned ...

    If the SDK wanted me to pass them as a batch why doesn't the Discover() method just expect a return value of DiscoveredDevice slice? Why would I be given a deviceCh at time of Initialize() if the SDK did not want me to use it to asynchronously pass discovered devices the same way i use asyncCh to send Readings?

  4. Discovery() was never meant to be a long-running operation that trickled up devices over a long period of time ...

    In the specific case of this service, even if the discovery only takes 2 seconds and I discover a device in the first 50 milliseconds, why do I have to wait the whole 2 seconds to notify an asynchronous channel that I have found a device? This means needlessly buffering up an array when I could just immediately pass it to the channel.

    As an aside, I do think there are use-cases where a long-running discovery is a good idea. In internal discussions we even contemplated the idea of discovery being its own container/process itself.

iain-anderson commented 3 years ago

I would guess that what we'd wish to do is something like:

Would this achieve what we need and can it be done without breaking?

tonyespy commented 3 years ago

@tonyespy I can't comment on the re-factored version as I did not code-review it, but I can attest that the original code was not very pragmatic Go IMHO.

@ajcasagrande IMHOP we should've released device discovery under an experimental flag. It was developed as a stand-alone feature, when it would've made more sense to release it in conjunction with a device service which implemented discovery. Instead it was released without much actual testing, thus your team was really the first to ever use it, which meant you ran into many of the issues are described below...

In the future, hopefully we'll abstain from releasing new SDK features in isolation like this.

As a developer I ran into many issues implementing the Discovery method due to the SDK (not just related to the locking), at which point I looked to the source to see why my code was not working as expected. From an outsider looking in, I was confused by the implementation as it did not seem to make sense.

  1. Ask yourself which of these statements sound like what the SDK wants to accomplish:

    1. I want subsequent calls to Discover() to be ignored as long as the first call to Discover() is still running.
    2. I want subsequent calls to Discover() to be ignored unless I have received data on a long-lived semi-global deviceCh that I gave them upon Initialization?

Again, much of this falls out of poor requirements. The Discover() method is supposed to be a trigger for device discovery. The whole concept of locking was introduced in the implementation, and as I've pointed out was not even mentioned in the godoc for the Discover() method.

  1. Why would you release a lock inside of a select statement that is nowhere near where the lock was obtained (or even same go package)? There is zero traceability and opens up the gates for resource leaking and/or deadlocks, etc.

No arguments... in theory, the serialization (if required) could've been handled solely via channels. That said, the fix now requires a wrapper function in order to handle the synchronization which is equally bad IMHOP.

  1. Discovery ... was intended to be called and result in a batch of discovered devices to be returned ...

    If the SDK wanted me to pass them as a batch why doesn't the Discover() method just expect a return value of DiscoveredDevice slice? Why would I be given a deviceCh at time of Initialize() if the SDK did not want me to use it to asynchronously pass discovered devices the same way i use asyncCh to send Readings?

The channel as defined by the Initialize() method is restricted to []DiscoveredDevice.

Initialize(lc logger.LoggingClient, asyncCh chan<- *AsyncValues, deviceCh chan<- []DiscoveredDevice) error

Again, theDiscovery() method doesn't define a retval for discovered devices because again, it's meant to be a trigger for discovery. Maybe it should've been named TriggerDiscovery()?

  1. Discovery() was never meant to be a long-running operation that trickled up devices over a long period of time ...

    In the specific case of this service, even if the discovery only takes 2 seconds and I discover a device in the first 50 milliseconds, why do I have to wait the whole 2 seconds to notify an asynchronous channel that I have found a device? This means needlessly buffering up an array when I could just immediately pass it to the channel. As an aside, I do think there are use-cases where a long-running discovery is a good idea. In internal discussions we even contemplated the idea of discovery being its own container/process itself.

There's no reason discovery couldn't be a long running process similar to the way asynchronous readings are handled, however this wasn't part of the original requirements.

tonyespy commented 3 years ago

I would guess that what we'd wish to do is something like:

  • Discovery() is called by the SDK, either in response to action on the REST endpoint, or its internal schedule.
  • The implementation of that function posts discovered devices to the channel.

This is what we have today (see comment below).

  • Implementations where a long-running discovery process is appropriate may set up their own goroutine that posts to the channel whenever a device is found; they would presumably not implement the Discovery() function or provide one which merely logged a warning.

I don't see why the two are mutually exclusive as pointed out above, the Discover() method is meant to be a trigger only.

  • The SDK ensures that only one instance of the Discovery() function runs at a time by acquiring a synchronization object before calling, and releasing it on return. Acquisition would be via a trylock-type call.

I think we should re-examine whether this serialization is actually needed? What are we trying to accomplish? AFAIK we don't have any other REST endpoints that we throttle like this. The implementation could choose to ignore subsequent calls to Discover() if it had already kicked off a long-running discovery operation couldn't it?

Would this achieve what we need and can it be done without breaking?

iain-anderson commented 3 years ago

Key thing here is to document what we have, eg

iain-anderson commented 2 years ago

closing this as the doc elements are now covered elsewhere - see discussion on https://github.com/edgexfoundry/edgex-docs/issues/598 can raise a new issue if it's felt that the implementation can be improved