edgexfoundry / device-opc-ua

Apache License 2.0
28 stars 26 forks source link

OOM will occur when large amounts of data are collected #53

Closed ethan256 closed 2 months ago

ethan256 commented 2 months ago

🐞 Bug Report

Affected Services [REQUIRED]

The issue is located in: When collecting a large amount of data, even if the [Retention](https://docs.edgexfoundry.org/3.2/microservices/core/data/details/DataRetention/) policy is configured in core-data, the database service that will edgex after a period of time will still experience OOM. vms: cpu: 4 memory: 8GB ### Description and Minimal Reproduction [**REQUIRED**] 1. Start a opcua server simulator. You can write an opcua server-side program using python. 2. Build the [docker compose file](https://github.com/edgexfoundry/edgex-compose) 3. Using environment variables to override the core-data configuration ``` RETENTION_ENABLED: true RETENTION_INTERVAL: 10s RETENTION_MAXCAP: 10 RETENTION_MINCAP: 5 ``` 5. start docker compose services. 6. Create deviceprofile and device via [api](https://docs.edgexfoundry.org/3.2/microservices/core/metadata/ApiReference/) 7. Enable logging of core-data service as DEBUG in consul and observation service resource utilization ## πŸ”₯ Exception or Error 1. Increasing number of keys in redis, OOM in VMs. 2. debug logs: "Prepare to delete 0 readings" ## 🌍 Your Environment **Deployment Environment:** ubuntu22.04-wsl2 **EdgeX Version [**REQUIRED**]:** 3.2.0 **Anything else relevant?** 1. after 11h, redis service restart ![image](https://github.com/edgexfoundry/device-opc-ua/assets/46949410/38b16df7-27ab-4b60-b10b-2a4268725ac3) 2. Number of keys in redis over 1000w ![image](https://github.com/edgexfoundry/device-opc-ua/assets/46949410/8c9246d2-99db-41e8-a611-947a92e79f21) ![image](https://github.com/edgexfoundry/device-opc-ua/assets/46949410/f4458f04-3c28-4678-8f7a-be4865eaf135)
cloudxxx8 commented 2 months ago

Please try to set WRITABLE_PERSISTDATA: 'false' in core-data to confirm whether it is the device-opc-ua issue

ethan256 commented 2 months ago

Please try to set WRITABLE_PERSISTDATA: 'false' in core-data to confirm whether it is the device-opc-ua issue

By Debugging the core-data service and intercepting one of the opcua data as follows:

{
   "apiVersion":"v3",
   "requestId":"175704ad-5c95-4c2b-9c99-93525ac755b8",
   "event":{
      "apiVersion":"v3",
      "id":"83cb0bf3-5887-4482-a0dc-7f5dd7534305",
      "deviceName":"device_1",
      "profileName":"profile_1",
      "sourceName":"OperateVariable143",
      "origin":1714456509397999065,
      "readings":[
         {
            "id":"aed392e1-b6ab-4fef-ac1a-735256b0b2de",
            "origin":1714456509397,
            "deviceName":"device_1",
            "resourceName":"Var143",
            "profileName":"profile_1",
            "valueType":"Float64",
            "value":"7.831199e+00"
         }
      ]
   }
}

The data generated by the opcua collection service has a difference of 6 orders of magnitude between readings[0].origin and event.origin, both of whose data are collection timestamps, but ms for the former and ns for the latter.

According to the PR it is understood that the Retention mechanism for core-data is to use read origins instead of event origins, deleting events and reads by age.

However, the time unit for retention in core data is ns, and the time unit for readings is ms, thus causing retention to fail. The detailed retention mechanism is as follows: https://github.com/edgexfoundry/edgex-go/blob/main/internal/core/data/application/reading.go#L267 https://github.com/edgexfoundry/edgex-go/blob/main/internal/pkg/infrastructure/redis/event.go#L107

I think the best way to handle this is to standardize the time unit of origin to ns during opcua capture