edgexfoundry / device-onvif-camera

Owner: Device WG
Apache License 2.0
29 stars 37 forks source link

Implement Start API to resolve Device SDK cache inconsistency, causing device not found errors #308

Closed ajcasagrande closed 1 year ago

ajcasagrande commented 1 year ago

šŸž Bug Report

Affected Services [REQUIRED]

device-sdk-go

Is this a regression?

Yes

Description and Minimal Reproduction [REQUIRED]

All edgex services are running normally. Called docker stop . Wait a while. Call docker start . Observe issue where no device commands work due to 404 device not found error, even though the devices clearly exist in core-metadata.

The first error message code is located in device.Initialize and it calls for _, device := range d.sdkService.Devices() {, which in turn calls return cache.Devices().All(), which leads me to believe the devices do in fact exist in the cache.

šŸ”„ Exception or Error


level=INFO ts=2023-02-17T16:47:20.247530253Z app=device-onvif-camera source=driver.go:141 msg="Initializing onvif client for 'tp-link-Tapo-C200-3fa1fe68-b915-4053-a3e1-1027f5ea88f4' camera"
level=ERROR ts=2023-02-17T16:47:20.440493715Z app=device-onvif-camera source=manageddevices.go:56 msg="failed to find Device tp-link-Tapo-C200-3fa1fe68-b915-4053-a3e1-1027f5ea88f4 in cache"
level=ERROR ts=2023-02-17T16:47:20.440673388Z app=device-onvif-camera source=driver.go:145 msg="failed to initialize onvif client for 'tp-link-Tapo-C200-3fa1fe68-b915-4053-a3e1-1027f5ea88f4' camera, skipping this device."
level=INFO ts=2023-02-17T16:47:20.440714682Z app=device-onvif-camera source=driver.go:141 msg="Initializing onvif client for 'HIKVISION-DS-2DE2A404IW-DE3-099f4000-4d50-11b4-82c8-c06ded544d67' camera"
level=ERROR ts=2023-02-17T16:47:20.491345631Z app=device-onvif-camera source=manageddevices.go:56 msg="failed to find Device HIKVISION-DS-2DE2A404IW-DE3-099f4000-4d50-11b4-82c8-c06ded544d67 in cache"
level=ERROR ts=2023-02-17T16:47:20.491489277Z app=device-onvif-camera source=driver.go:145 msg="failed to initialize onvif client for 'HIKVISION-DS-2DE2A404IW-DE3-099f4000-4d50-11b4-82c8-c06ded544d67' camera, skipping this device."
...
level=INFO ts=2023-02-17T16:47:23.952698178Z app=device-onvif-camera source=driver.go:610 msg="Discovered 2 device(s) in 2.44027692s via netscan."
level=INFO ts=2023-02-17T16:47:23.952931049Z app=device-onvif-camera source=async.go:104 msg="Adding discovered device HIKVISION-DS-2DE2A404IW-DE3-099f4000-4d50-11b4-82c8-c06ded544d67 to Metadata"
level=INFO ts=2023-02-17T16:47:23.958015598Z app=device-onvif-camera source=async.go:104 msg="Adding discovered device tp-link-Tapo-C200-3fa1fe68-b915-4053-a3e1-1027f5ea88f4 to Metadata"
level=ERROR ts=2023-02-17T16:47:34.382891635Z app=device-onvif-camera source=restrouter.go:168 X-Correlation-ID=8b894975-57a7-43d3-b1d6-70a83e510116 msg="device HIKVISION-DS-2DE2A404IW-DE3-099f4000-4d50-11b4-82c8-c06ded544d67 not found"

šŸŒ Your Environment

Deployment Environment: docker compose 2.x edgex compose levski branch v2.3.0

EdgeX Version [REQUIRED]: 2.3.0

Anything else relevant?

jinlinGuan commented 1 year ago

I cannot recreate the error by using device-virtual or device-modbus with 2.3.0 or main branch. Could you provide us with the onvif camera setup steps, the profile and the device information?

lenny-goodell commented 1 year ago

@ajcasagrande, is this with 2.3.0, or the latest WIP 3.0.0 code?

presatish commented 1 year ago

@lenny-intel facing this issue with Levski.

ajcasagrande commented 1 year ago

@ajcasagrande, is this with 2.3.0, or the latest WIP 3.0.0 code?

Yes this is using levski branch of edgex compose, stock 2.3.0 images (ignore the first usb one):

image

Full docker compose: docker-compose.yml.txt

ajcasagrande commented 1 year ago

Ok, so I have looked into it further, and the issue is being caused due to a timing issue in regards to the actual architecture of EdgeX (device-sdk-go) itself. Since EdgeX only provides an Initialize function and not a Start function, we (device-onvif-camera devs) have placed logic in the Initialize block which will ultimately cause the UpdateDevice to be called.

However, looking at the code: https://github.com/edgexfoundry/device-sdk-go/blob/v2.3.0/pkg/service/init.go#L56-L67

The Initialize function is called before the selfRegister function. Because of this, the dic has not been updated to reflect the current running service, thus resulting in the empty value being returned.

The problem is that I assume this order was chosen to prevent a device service that failed initialization from registering. This is where the connundrum happens. So my suggestions forward is that I will create a feature request to implement a Start callback for the service which is called after the registration and bootstrapping. And then for the current codebase, implement a hacky async sleep to allow the registration to occur before the devices are communicated with.

Edit: Alternatively, a potential fix for the current codebase would be to move these line of code https://github.com/edgexfoundry/device-sdk-go/blob/v2.3.0/pkg/service/service.go#L221-L227 to a new function which is called earlier in the bootstrap process (before driver.Initialize at least).

lenny-goodell commented 1 year ago

@ajcasagrande, we will need to back port you hack to the levski code and release a path since this is a blocking bug, correct?

ajcasagrande commented 1 year ago

@ajcasagrande, we will need to back port you hack to the levski code and release a path since this is a blocking bug, correct?

Yes, either the hack will need to be applied to the levski release of device-onvif-camera (and possibly device-usb-camera), or the alternative method would need to be patched in levski device-sdk-go.

lenny-goodell commented 1 year ago

Edit: Alternatively, a potential fix for the current codebase would be to move these line of code https://github.com/edgexfoundry/device-sdk-go/blob/v2.3.0/pkg/service/service.go#L221-L227 to a new function which is called earlier in the bootstrap process (before driver.Initialize at least).

@ajcasagrande , this seems like a good fix for Levski and then the Start API for Minnesota for better control. Thoughts?

ajcasagrande commented 1 year ago

@lenny-intel Yes, I like that fix for Levski as opposed to doing the hack-y sleep method, assuming there are no other issues that arise after patching it.

lenny-goodell commented 1 year ago

Moved this to the ONVIF repo for implementing the Start() API for this service

lenny-goodell commented 1 year ago

Latest SDK requires the Start API() to be implemented.

cloudxxx8 commented 1 year ago

@ajcasagrande do you need my team to help fix this issue in 3.0?

cloudxxx8 commented 1 year ago

@ajcasagrande @presatish are you able to resolve this by the code freeze date (May 10th)?

presatish commented 1 year ago

@cloudxxx8 I think we should be able to. We have prioritized this issue along with others to complete by May 10th.

ajcasagrande commented 1 year ago

@cloudxxx8 Sorry, my notifications seem to not be working properly, or I set my outlook rules incorrectly. Yes, we will address this. Bruce has already added the stub for it, and I can implement it after my current refactor PR, since it may cause conflicts if done separately.