Closed ajcasagrande closed 2 years ago
Related to #1053 and #1055 ?
Looks like this is caused by a race condition between AddProvisionWatcher (which will attempt to add missing profile) and LoadProfiles (which is attempting to load up all the profiles).
Some higher level locking needs to occur so AddProvisionWatcher is blocked while LoadProfiles is running.
This issue is related more on a fundamental level to how the caching interface was designed, and this isn't the only place where the race-condition is likely present.
Here is some example code:
Now look at the code for ForName
:
and Add
:
The problem comes from the fact that the p.mutex
is not exposed to us, so between the ForName
check and the Add
anything can happen. This should be done as an atomic operation.
A possible alternative:
Add an AddIfMissing
function which will perform the operation without ever losing the lock:
// AddIfMissing adds a new profile to the cache if one does not exist.
func (p *profileCache) AddIfMissing(profile models.DeviceProfile) errors.EdgeX {
p.mutex.Lock()
defer p.mutex.Unlock()
if _, found := p.deviceProfileMap[profile.Name]; found {
return nil // profile has already been added, nothing to do
}
return p.add(profile)
}
More generic way of allowing the user to hold the lock, though looking at the code I do not know if it would work for this code base
Actually it looks like this usage is what casues such a large window for the race-condition to be triggered.
On line 33 it is going out to the network to retrieve some information, and then without even checking if something has changed in the meantime proceeds to add the profile on line 39.
A quick bandaid to lessen the chances of the race-condition occurring (again still best to just make it atomic), would be to add another cache.Profiles().ForName(watcher.ProfileName)
check before the Add
on line 39.
_, ok := cache.Profiles().ForName(watcher.ProfileName)
if !ok {
res, err := s.edgexClients.DeviceProfileClient.DeviceProfileByName(context.Background(), watcher.ProfileName)
if err != nil {
errMsg := fmt.Sprintf("failed to find Profile %s for provision watcher %s", watcher.ProfileName, watcher.Name)
s.LoggingClient.Error(errMsg)
return "", err
}
// make sure the profile is still missing from the cache before attempting to add it
if _, found := cache.Profiles().ForName(watcher.ProfileName); found {
err = cache.Profiles().Add(dtos.ToDeviceProfileModel(res.Profile))
if err != nil {
return "", errors.NewCommonEdgeX(errors.KindServerError, fmt.Sprintf("failed to cache the profile %s", res.Profile.Name), err)
}
}
}
We should just remove all the Profile related check in managedwatcher.go and manageddevice.go, because they are checked in core-metadata. Also, the profile will be added to the cache in callback.
@cloudxxx8 @weichou1229 Does this fix need to be ported back to the opensource?
🐞 Bug Report
Affected Services [REQUIRED]
device-sdk-go device-onvif-camera
Is this a regression?
No sure
Description and Minimal Reproduction [REQUIRED]
Attempting to add pre-defined devices using
camera.toml
as well as have AutoDiscovery enabled. Due to this issue, the call to add provision watchers fails, causing any discovered device to not be added to EdgeX.Add pre-defined devices in
camera.toml
for thedevice-onvif-service
🔥 Exception or Error
🌍 Your Environment
Deployment Environment: Docker CE
EdgeX Version [REQUIRED]: 2.1
Anything else relevant?