espressif / esp-insights

ESP Insights: A remote diagnostics/observability framework for connected devices
Apache License 2.0
101 stars 27 forks source link

Problem orchestrating ESP-Insights send with second HTTP POST #43

Open Timmwardion opened 3 months ago

Timmwardion commented 3 months ago

Hi there - implementing ESP-Insights for the first time and struggling with a situation I think has to do with multiple HTTP POST requests happing on my device.

Essentially my device has two key jobs:

The primary task executes well until I try and implement ESP-Insights along-side it.

When I implement ESP-Insights though I get a cascade of things going wrong (see below)

In my mind all of that points to the insights post taking too long (possibly because of task collision with the other POST task), maxing out memory trying to store new metrics, and triggering the dog while waiting for the connection and crashing.

I don't see in any of the documentation a way to set the insights timeout or somehow manage a semaphore between the two tasks so I'm wondering how I can manage this. My primary task has the potential to be fairly active but even if I slowed it down temporarily there is likely to be collisions into the future.

I notice that documentation on transport sharing focusses on MQTT. Does this mean MQTT is recommended? I don't have the rainmaker agent running at this point so unless I'm wrong, it seems like I won't get any benefit from transport sharing unless there is some benefit to getting the insights POST client and my primary POST client aligned.

I also notice that documentation points out that "The RTC memory is limited and flooding too many messages, makes RTC storage full. Once full, the framework has to drop the messages. One solution is to increase the data reporting frequency to cloud, and thereby emptying the RTC store frequently." But I can't find any reference to how to increase the data reporting frequency. I have found the esp_insights_send_data() function but that seems to just add to the rainmaker queue - which if it's blocked or in the process of timing out, won't succeed in purging the metrics in the RTC_STORE.

EDIT: I found the settings for this in menuconfig. You can set max and min interval for posts:

Also below is the minimal code for my implementation of ESP-Insights. pretty much a copy/paste from the examples.

`

W (9573) heap_metrics: Failed to add heap metric key:free
I (9583) esp_insights: esp_diag_data_store_non_critical_write failed group heap, len 40, err 0x0101
I (9593) heap: alloc_fail size:0x738 func:heap_caps_malloc
I (9593) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9603) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9613) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9623) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9623) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9633) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9643) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
I (9653) esp_insights: ESP_DIAG_DATA_STORE_EVENT_NON_CRITICAL_DATA_LOW_MEM
W (9653) esp-tls: Failed to open new connection in specified timeout
I (9663) esp_insights: Log write fail count: 1
RTC_STORE, curr_free 103, req_free 121
I (9673) esp_insights: ESP_DIAG_DATA_STORE_EVENT_CRITICAL_DATA_LOW_MEM
I (9683) esp_insights: esp_diag_data_store_critical_write failed len 120, err 0x0101
E (9663) transport_base: Failed to open a new connection
I (9693) esp_insights: Log write fail count: 2
RTC_STORE, curr_free 103, req_free 121
I (9703) esp_insights: ESP_DIAG_DATA_STORE_EVENT_CRITICAL_DATA_LOW_MEM
I (9713) esp_insights: esp_diag_data_store_critical_write failed len 120, err 0x0101
E (9693) HTTP_CLIENT: Connection failed, sock < 0
I (9723) esp_insights: Log write fail count: 3
RTC_STORE, curr_free 103, req_free 121
I (9733) esp_insights: ESP_DIAG_DATA_STORE_EVENT_CRITICAL_DATA_LOW_MEM
I (9743) esp_insights: esp_diag_data_store_critical_write failed len 120, err 0x0101
E (9723) tport_https: esp_http_client_perform failed err:0x7002
I (9753) esp_insights: meta message send failed
I (5234) esp_core_dump_flash: Core dump data checksum is correct
E (14473) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (14473) task_wdt:  - IDLE (CPU 0)
E (14473) task_wdt: Tasks currently running:
E (14473) task_wdt: CPU 0: rmaker_queue_ta
E (14473) task_wdt: Print CPU 0 (current core) backtrace
Print CPU 0 (current core) registers
Stack dump detected
Core  0 register dump:
MEPC    : 0x420191ac  RA      : 0x42019264  SP      : 0x40868fd0  GP      : 0x4081ab74  
0x420191ac: append_to_buffer at /Users/.../managed_components/espressif__cbor/tinycbor/src/cborencoder.c:276
0x42019264: encode_number_no_update at /Users/...r/managed_components/espressif__cbor/tinycbor/src/cborencoder.c:335

TP      : 0x40869170  T0      : 0x40022494  T1      : 0xffffffe0  T2      : 0x00000000  
0x40022494: multi_heap_internal_unlock in ROM

S0/FP   : 0x40869040  S1      : 0x00000005  A0      : 0x40869040  A1      : 0x40868feb  
A2      : 0x00000005  A3      : 0x00000000  A4      : 0x00000004  A5      : 0x00000010  
A6      : 0x00ff0000  A7      : 0x0000ff00  S2      : 0x00000000  S3      : 0x6f714e78  
S4      : 0x00000000  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000  
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000  
T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000  
MSTATUS : 0x00000000  MTVEC   : 0x40869040  MCAUSE  : 0x00000000  MTVAL   : 0x42019264  
0x42019264: encode_number_no_update at /Users/.../managed_components/espressif__cbor/tinycbor/src/cborencoder.c:335`
#include "esp_insights.h"
#include "esp_diagnostics_system_metrics.h"
#include "esp_diagnostics_variables.h"
#include "esp_rmaker_utils.h"

void app_main(void)
{   

 esp_insights_config_t config  = {
        .log_type = ESP_DIAG_LOG_TYPE_ERROR,
        .auth_key = ESP_INSIGHTS_AUTH_KEY,
    };

    esp_err_t i_ret = esp_insights_init(&config);
    if (i_ret != ESP_OK) {
        ESP_LOGE(TAG, "Failed to initialize ESP Insights, err:0x%x", i_ret);
    }
    ESP_ERROR_CHECK(i_ret);

    while(1){
    //other code goes here

    ret = esp_insights_send_data(); //purge insights buffer
        if (ret != ESP_OK) {
            ESP_LOGE(TAG, "Failed to send data to ESP Insights, err:0x%x", ret);
        } else {
            ESP_LOGI(TAG, "Data sent to ESP Insights");
        }
    }

}
}
Timmwardion commented 3 months ago

Update: I've now changed my code to call the primary HTTP post via the esp_rmaker_work_queue_add_task(). Which...honestly does something but the problems are still there.

Seems to mainly be a watchdog issue on the rmaker_queue_task function that seems to run a continuous loop which as far as I can tell has no opportunity for a wdt refresh. See line 58-60 of work_queue.c.

vikramdattu commented 3 months ago

Hi @Timmwardion,

You do not have to explicitly call esp_insights_send_data. This is taken care already by esp_insights. If you still wish to trigger data sends explicitly as well, you should have some delay in the loop.

e.g.,

    while(1){
    //other code goes here

    ret = esp_insights_send_data(); //purge insights buffer
        if (ret != ESP_OK) {
            ESP_LOGE(TAG, "Failed to send data to ESP Insights, err:0x%x", ret);
        } else {
            ESP_LOGI(TAG, "Data sent to ESP Insights");
        }

        /* Trigger data sends every 5 seconds */
        vTaskDelay(pdMS_TO_TICKS(5 * 1000));
    }

This will save you from tirelessly performing the task and avoiding the watch dog.

Timmwardion commented 3 months ago

I've tried that - and tried taking out the explicit call to esp_insights_send_data() all together stripping the implementation back to a minimal version and I still get the WDT timeout on rmaker_queue_task.

Is there a way I can stop the timer-based calls to send data and instead trigger it explicitly in code? I'm wondering if the timer-based sends are the issue.

My firmware runs some continuous loops so it makes sense to sync the insights posts with those loops anyway.

vikramdattu commented 2 months ago

Hi @Timmwardion It is not currently possible to not report insights data via configurations currently. I would encourage you to use internal reporting. It does not cost any extra task, as it simply adds the task to rmaker_task_queue which already is initialised. As far as Watchdog is concerned, you should not tirelessly do the operation. Instead you should have some delay between two report calls.

If you still want to do reporting from your custom code:

  1. What you can do instead is simply increase the reporting interval to a very large number so that the internal reporting is scheduled less often.

  2. Also, if you want to, you can make changes in the esp_insights.c and remove these lines which schedule the reporting: https://github.com/espressif/esp-insights/blob/3318d7f02a38646cbc1bd1559feb2f2f8c55f7bc/components/esp_insights/src/esp_insights.c#L863