edgexfoundry / device-sdk-go

Owner: Device WG
Apache License 2.0
90 stars 125 forks source link

Error adding provision watchers #1195

Closed ajcasagrande closed 2 years ago

ajcasagrande commented 2 years ago

🐞 Bug Report

Affected Services [REQUIRED]

device-sdk-go device-onvif-camera

Is this a regression?

No sure

Description and Minimal Reproduction [REQUIRED]

Attempting to add pre-defined devices using camera.toml as well as have AutoDiscovery enabled. Due to this issue, the call to add provision watchers fails, causing any discovered device to not be added to EdgeX.

Add pre-defined devices in camera.toml for the device-onvif-service

🔥 Exception or Error

Error adding provision watchers. Newly discovered devices may fail to register with EdgeX: error adding provision watcher Generic-Onvif-Provision-Watcher -> failed to cache the profile onvif-camera -> Profile onvif-camera has already existed in cache"

level=INFO ts=18:13:31.269525838Z source=profiles.go:54 msg="Loading pre-defined profiles from /res/profiles"
level=INFO ts=18:13:31.277129267Z source=profiles.go:97 msg="Profile onvif-camera not found in Metadata, adding it ..."
level=INFO ts=18:13:31.278251195Z source=profiles.go:97 msg="Profile device-onvif-camera not found in Metadata, adding it ..."
level=INFO ts=18:13:31.288411216Z source=devices.go:49 msg="Loading pre-defined devices from /res/devices"
level=INFO ts=18:13:31.288714825Z source=devices.go:87 msg="Device Camera-1 not found in Metadata, adding it ..."
level=INFO ts=18:13:31.288742398Z source=devices.go:87 msg="Device Camera-3 not found in Metadata, adding it ..."
level=INFO ts=18:13:31.288894343Z source=devices.go:87 msg="Device device-onvif-camera not found in Metadata, adding it ..."
level=INFO ts=18:13:31.294546432Z source=message.go:50 msg="Service dependencies resolved..."
level=INFO ts=18:13:31.29458733Z source=message.go:51 msg="Starting device-onvif-camera 0.0.0 "
level=INFO ts=18:13:31.294601369Z source=message.go:55 msg="Camera device service started"
level=INFO ts=18:13:31.294613007Z source=message.go:58 msg="Service started in: 12.152520328s"
level=INFO ts=18:13:31.294669415Z source=autodiscovery.go:51 msg="Starting auto-discovery with duration 1h0m0s"
level=DEBUG ts=18:13:31.294693334Z source=discovery.go:34 msg="protocol discovery triggered"
level=INFO ts=18:13:31.294706759Z source=driver.go:508 msg="Discover was called."
level=INFO ts=18:13:31.294723848Z source=driver.go:255 msg="Adding provision watchers from res/provision_watchers"
level=DEBUG ts=18:13:31.294972346Z source=driver.go:262 msg="1 provision watcher files found"
level=DEBUG ts=18:13:31.294995351Z source=driver.go:267 msg="processing res/provision_watchers/generic.provision.watcher.json"
level=ERROR ts=18:13:31.295122632Z source=managedwatchers.go:68 msg="failed to find ProvisionWatcher Generic-Onvif-Provision-Watcher in cache"
level=INFO ts=18:13:31.295151044Z source=driver.go:293 msg="Adding provision watcher:Generic-Onvif-Provision-Watcher"
level=DEBUG ts=18:13:31.300365705Z source=callback.go:56 msg="device device-onvif-camera added"
level=DEBUG ts=18:13:31.300960274Z source=callback.go:61 msg="Invoked driver.AddDevice callback for device-onvif-camera"
level=DEBUG ts=18:13:31.301711768Z source=callback.go:67 msg="starting AutoEvents for device device-onvif-camera"
level=DEBUG ts=18:13:31.30055395Z source=callback.go:56 msg="device Camera-1 added"
level=DEBUG ts=18:13:31.302105343Z source=transform.go:96 msg="failed to read ResourceOperation: failed to find ResourceOpertaion with DeviceResource CameraAdded in Profile device-onvif-camera"
level=DEBUG ts=18:13:31.302164964Z source=transform.go:121 msg="device: device-onvif-camera DeviceResource: CameraAdded reading: {Id:41f710b9-7f92-4b22-8705-e8b112acf6be Origin:1658340811302065951 DeviceName:device-onvif-camera ResourceName:CameraAdded ProfileName:device-onvif-camera ValueType:String Units: BinaryReading:{BinaryValue:[] MediaType:} SimpleReading:{Value:Camera-1} ObjectReading:{ObjectValue:}}"
level=INFO ts=18:13:31.302269665Z source=credentials.go:106 msg="Found credentials for device Camera-1"
level=DEBUG ts=18:13:31.304343722Z source=callback.go:56 msg="device Camera-3 added"
level=INFO ts=18:13:31.304516781Z source=credentials.go:106 msg="Found credentials for device Camera-3"
level=ERROR ts=18:13:31.304605588Z source=driver.go:524 msg="Error adding provision watchers. Newly discovered devices may fail to register with EdgeX: error adding provision watcher Generic-Onvif-Provision-Watcher -> failed to cache the profile onvif-camera -> Profile onvif-camera has already existed in cache"
level=DEBUG ts=18:13:31.304906541Z source=transform.go:96 msg="failed to read ResourceOperation: failed to find ResourceOpertaion with DeviceResource CameraAdded in Profile device-onvif-camera"
level=DEBUG ts=18:13:31.304962163Z source=transform.go:121 msg="device: device-onvif-camera DeviceResource: CameraAdded reading: {Id:008be9b1-3b54-4a83-b263-ea1075d6fde3 Origin:1658340811304854845 DeviceName:device-onvif-camera ResourceName:CameraAdded ProfileName:device-onvif-camera ValueType:String Units: BinaryReading:{BinaryValue:[] MediaType:} SimpleReading:{Value:Camera-3} ObjectReading:{ObjectValue:}}"
level=DEBUG ts=18:13:31.30520772Z source=utils.go:80 msg="Event(profileName: device-onvif-camera, deviceName: device-onvif-camera, sourceName: CameraAdded, id: 1f38062f-c152-4219-b479-c00cd3c7eb53) published to MessageBus"
level=DEBUG ts=18:13:31.306541534Z source=utils.go:80 msg="Event(profileName: device-onvif-camera, deviceName: device-onvif-camera, sourceName: CameraAdded, id: 0a3e69aa-51c9-4659-ae71-217ab9a05cd1) published to MessageBus"
level=DEBUG ts=18:13:31.339002897Z source=callback.go:61 msg="Invoked driver.AddDevice callback for Camera-1"
level=DEBUG ts=18:13:31.339031395Z source=callback.go:67 msg="starting AutoEvents for device Camera-1"
level=DEBUG ts=18:13:31.345664679Z source=callback.go:61 msg="Invoked driver.AddDevice callback for Camera-3"
level=DEBUG ts=18:13:31.345687771Z source=callback.go:67 msg="starting AutoEvents for device Camera-3"

🌍 Your Environment

Deployment Environment: Docker CE

EdgeX Version [REQUIRED]: 2.1

Anything else relevant?

ajcasagrande commented 2 years ago

Related to #1053 and #1055 ?

ajcasagrande commented 2 years ago

configuration.toml.txt camera.toml.txt

lenny-goodell commented 2 years ago

Looks like this is caused by a race condition between AddProvisionWatcher (which will attempt to add missing profile) and LoadProfiles (which is attempting to load up all the profiles).

Some higher level locking needs to occur so AddProvisionWatcher is blocked while LoadProfiles is running.

ajcasagrande commented 2 years ago

This issue is related more on a fundamental level to how the caching interface was designed, and this isn't the only place where the race-condition is likely present.

Here is some example code:

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/internal/application/callback.go#L266-L268

Now look at the code for ForName:

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/internal/cache/profiles.go#L59-L67

and Add:

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/internal/cache/profiles.go#L86-L90

The problem comes from the fact that the p.mutex is not exposed to us, so between the ForName check and the Add anything can happen. This should be done as an atomic operation.

A possible alternative:

  1. Add an AddIfMissing function which will perform the operation without ever losing the lock:

    // AddIfMissing adds a new profile to the cache if one does not exist.
    func (p *profileCache) AddIfMissing(profile models.DeviceProfile) errors.EdgeX {
    p.mutex.Lock()
    defer p.mutex.Unlock()
    
    if _, found := p.deviceProfileMap[profile.Name]; found {
        return nil // profile has already been added, nothing to do
    }
    return p.add(profile)
    }
  2. More generic way of allowing the user to hold the lock, though looking at the code I do not know if it would work for this code base

ajcasagrande commented 2 years ago

Actually it looks like this usage is what casues such a large window for the race-condition to be triggered.

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/pkg/service/managedwatchers.go#L25

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/pkg/service/managedwatchers.go#L31-L39

On line 33 it is going out to the network to retrieve some information, and then without even checking if something has changed in the meantime proceeds to add the profile on line 39.

A quick bandaid to lessen the chances of the race-condition occurring (again still best to just make it atomic), would be to add another cache.Profiles().ForName(watcher.ProfileName) check before the Add on line 39.

_, ok := cache.Profiles().ForName(watcher.ProfileName)
if !ok {
    res, err := s.edgexClients.DeviceProfileClient.DeviceProfileByName(context.Background(), watcher.ProfileName)
    if err != nil {
        errMsg := fmt.Sprintf("failed to find Profile %s for provision watcher %s", watcher.ProfileName, watcher.Name)
        s.LoggingClient.Error(errMsg)
        return "", err
    }
    // make sure the profile is still missing from the cache before attempting to add it
    if _, found := cache.Profiles().ForName(watcher.ProfileName); found {
        err = cache.Profiles().Add(dtos.ToDeviceProfileModel(res.Profile))
        if err != nil {
            return "", errors.NewCommonEdgeX(errors.KindServerError, fmt.Sprintf("failed to cache the profile %s", res.Profile.Name), err)
        }
    }
}
cloudxxx8 commented 2 years ago

We should just remove all the Profile related check in managedwatcher.go and manageddevice.go, because they are checked in core-metadata. Also, the profile will be added to the cache in callback.

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/internal/application/callback.go#L171-L174

https://github.com/edgexfoundry/device-sdk-go/blob/ff2bdc244935d5b40da6f20198b29dfc476efd7a/internal/application/callback.go#L42-L46

iain-anderson commented 2 years ago

@cloudxxx8 @weichou1229 Does this fix need to be ported back to the opensource?