0neblock / Arduino_SNMP

SNMP Agent built with Arduino
MIT License
77 stars 30 forks source link

CORRUPT PACKET causes reboot on ESP32 #11

Closed mcpicoli closed 2 years ago

mcpicoli commented 4 years ago

Hi,

While I've successfully implemented SNMP in some of my current projects using this library and the ESP32 platform, I believe there is some kind of problem or bug related to the memory allocation/deallocation done by the library when dealing with corrupt packets in an unreliable network.

For reference, take the example ESP32_SNMP provided.

In the setup routine, add many more integer handlers (like, 20 or so). The content of the monitored variables doesn't matter, they all can be set to zero.

Also, instead of putting "snmp.loop()" in the main loop, I'm using an ESP32 task, so, in the setup routine there is the task startup.

So, boiling it down and removing everything unneeded:

#include <WiFi.h>
#include <WiFiUdp.h>
#include <Arduino_SNMP.h>

const char* ssid = "ssid";
const char* password = "password"
WiFiUDP udp;
SNMPAgent snmp = SNMPAgent("public");  // Starts an SMMPAgent instance with the community string 'public'

int changingNumber = 1;
int changingNumber1 = 1;
int changingNumber2 = 1;
int changingNumber3 = 1;
int changingNumber4 = 1;
int changingNumber5 = 1;
int changingNumber6 = 1;
int changingNumber7 = 1;
int changingNumber8 = 1;
int changingNumber9 = 1;
int changingNumber10 = 1;
int changingNumber11 = 1;

void setup(){
    Serial.begin(115200);
    WiFi.begin(ssid, password);
    Serial.println("");

    // Wait for connection
    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }
    Serial.println("");
    Serial.print("Connected to ");
    Serial.println(ssid);
    Serial.print("IP address: ");
    Serial.println(WiFi.localIP());

    // give snmp a pointer to the UDP object
    snmp.setUDP(&udp);
    snmp.begin();

    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0", &changingNumber);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.1", &changingNumber1);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.2", &changingNumber2);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.3", &changingNumber3);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.4", &changingNumber4);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.5", &changingNumber5);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.6", &changingNumber6);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.7", &changingNumber7);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.8", &changingNumber8);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.9", &changingNumber9);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.10", &changingNumber10);
    snmp.addIntegerHandler(".1.3.6.1.4.1.5.0.11", &changingNumber11);

    // SNMP task instead of inside the loop routine
    xTaskCreate(snmp_task, "SNMP", 4000, NULL, 1, NULL);

    // In my real project, there are many other tasks running, but so far, at least one core is able to run the task all the time.
}

void loop(){
    // NO snmp.loop() here, it's inside the task
    // snmp.loop(); // must be called as often as possible
}

void snmp_task(void * p)
{
  while (true)
  {
    snmp.loop();
    delay(1);

    // Rever returns...
  }
}

Now, what happens in unreliable networks is that many times, the NMS reports "timeout" reading the values (as expected, since it is UDP based), but some times, the ESP32 reboots. IT dumps the following:

OID: .1.3.6.1.4.1.5.0.8
CORRUPT PACKET
Guru Meditation Error: Core  1 panic'ed (InstrFetchProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x14abba12  PS      : 0x00060c30  A0      : 0x800d4b22  A1      : 0x3ffd5030  
A2      : 0x14abba12  A3      : 0x00000000  A4      : 0x0000002b  A5      : 0x00000020  
A6      : 0x3ffd67b7  A7      : 0x0000005a  A8      : 0x800d4264  A9      : 0x3ffd5010  
A10     : 0x3ffd5878  A11     : 0x0000000e  A12     : 0x0000002b  A13     : 0x3ffbfc68  
A14     : 0x00000000  A15     : 0x00023980  SAR     : 0x00000010  EXCCAUSE: 0x00000014  
EXCVADDR: 0x14abba10  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xffffffff  

Backtrace: 0x14abba12:0x3ffd5030 0x400d4b1f:0x3ffd5050 0x400d4b83:0x3ffd50a0 0x400d4b8e:0x3ffd50c0 0x4008943d:0x3ffd50e0

Rebooting...

Other times, the error reported is "LoadProhibited" and some other times, "corrupt heap".

In a minority of cases, the "CORRUPT PACKET" message does not cause any problem.

The more reliable the network is, the problem is less pronounced, but any extended run time will eventually result in a reboot.

I also tried putting both the UDP client and the SNMPAgent instance in the ESP32's RAM using the DRAM_ATTR, to no avail.

Thanks in advance.

0neblock commented 4 years ago

Hi,

Sorry about this, i'm running a system where i use upwards of 100 OIDs and i'm not seeing this error on my end, although i'm not spawning a seperate task, just running in the Arduino loop, although I don't see why this would be an issue except for possibly stack issues? Could you try increasing the 4000 word stack to a bit larger, and see if that helps?

Could you also try to get the file positions of the crash using the Backtrace? Run a command like this next time you see a crash, or with the Backtrace that you've provided above:

xtensa-esp32-elf-addr2line -e build/app.elf 0x14abba12:0x3ffd5030 0x400d4b1f:0x3ffd5050 0x400d4b83:0x3ffd50a0 0x400d4b8e:0x3ffd50c0 0x4008943d:0x3ffd50e0

Please let me know where that pinpoints the crash down to, and I can have a further look.

Thanks.

mcpicoli commented 4 years ago

Hi,

Sorry about this, i'm running a system where i use upwards of 100 OIDs and i'm not seeing this error on my end, although i'm not spawning a seperate task, just running in the Arduino loop, although I don't see why this would be an issue except for possibly stack issues? Could you try increasing the 4000 word stack to a bit larger, and see if that helps?

Could you also try to get the file positions of the crash using the Backtrace? Run a command like this next time you see a crash, or with the Backtrace that you've provided above:

xtensa-esp32-elf-addr2line -e build/app.elf 0x14abba12:0x3ffd5030 0x400d4b1f:0x3ffd5050 0x400d4b83:0x3ffd50a0 0x400d4b8e:0x3ffd50c0 0x4008943d:0x3ffd50e0

Please let me know where that pinpoints the crash down to, and I can have a further look.

Thanks.

Hi,

Thanks to the stack size insight. I'll try it. However, about the stack trace, the one I posted before is from the "unboiled down" code. So, I'll have to rebuild the bad network scenario here and debug it from the code I sent before.

In the meantime (before your comment) I set up another network connection (very reliable in this case) and haven't seen a "CORRUPT PACKET" message ever since. The problem is still very relevant to me because I expect bad network connections in real world scenarios.

I'll post an update here as soon as I have this sorted.

mcpicoli commented 4 years ago

I am having a lot of trouble recreating the bad network scenario here using wifi. So, I did it with the full "unboiled down" version of my sketch. It uses Ethernet via a LAN8720 PHY and does a lot of other things besides SNMP.

I did as you said. Stack size was doubled, and nothing changed.

The stack trace for:

CORRUPT PACKET
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x400d4c0c  PS      : 0x00060c30  A0      : 0x800d4cae  A1      : 0x3ffd5230  
A2      : 0x3ffd689c  A3      : 0x3ffd56e4  A4      : 0x00000000  A5      : 0x00000000  
A6      : 0x00000000  A7      : 0x3ffbf6bc  A8      : 0x800d4c26  A9      : 0x3ffd5210  
A10     : 0x3ffd6428  A11     : 0x3f401754  A12     : 0x3ffd649c  A13     : 0x00000009  
A14     : 0x3ffd617f  A15     : 0x0000005a  SAR     : 0x00000004  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xffffffff  

Backtrace: 0x400d4c0c:0x3ffd5230 0x400d4cab:0x3ffd5280 0x400d4cb6:0x3ffd52a0 0x4008943d:0x3ffd52c0

Resulted in:

<drive_omitted>:\<path_omitted>\<sketch_name_omitted>.ino:2488
<drive_omitted>:\<path_omitted>\<sketch_name_omitted>.ino:2488
<drive_omitted>:\<path_omitted>\<sketch_name_omitted>.ino:2488
/Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/port.c:355 (discriminator 1)

For reference, the line 2488 (and the nearby lines) reads:

// ------ Sanidade de parâmetros ------ //
#if (defined(RELE_1) && RELE_1 == true) || (defined(RELE_2) && RELE_2 == true)
int sanidade_parametro_rele(bool valor)
{
  // Não há a possibilidade de erro aqui? (valor de retorno = 2)
  return valor?1:0;
}
#endif

#ifdef SENSOR_CONSUMO_PRESENTE
uint8_t sanidade_parametro_interface(uint8_t interface)
{
  // Construção um pouco esquisita, mas faz sentido quando se considera que podemos ter uma lista com "buracos" no meio porque uma ou mais das interfaces estão sendo dedicadas a outros tipos de utilização.
  switch (interface)
  {
    #if defined(SENSOR_CONSUMO_1) && SENSOR_CONSUMO_1 == true
    case 1:
      return 1; break;
    #endif
    #if defined(SENSOR_CONSUMO_2) && SENSOR_CONSUMO_2 == true
    case 2:
      return 2; break;
    #endif
    #if defined(SENSOR_CONSUMO_3) && SENSOR_CONSUMO_3 == true
    case 3:
      return 3; break;
    #endif
    #if defined(SENSOR_CONSUMO_4) && SENSOR_CONSUMO_4 == true
    case 4:
      return 4; break;
    #endif
    default:
      return 0;
  }
}

(Line 2488 is the last line of the first function). They're completely irrelevant to the SNMP code.

Sometimes, the error is "InstrFetchProhibited" instead of "LoadProhibited", but the stack trace leads to exactly the same result.

Any insights?

0neblock commented 4 years ago

Hey,

Can you just double check that you are passing the correct elf file to the addr2line function (the latest build of the elf file). It doesn’t make send that different backtrack addresses would resolve to the same line in a file. Do the backtrack functions mention which function it is in, as is usually should. The crash does not make sense in that file location, especially if it is only happening after a SNMP crash

mcpicoli commented 4 years ago

You're right. The backtrace didn't make any sense.

The problem is that simply recompiling the same code (+) seemingly doesn't recompile everything. I tried deleting the ELF file and recompiling again, it was regenerated, but the same result of the backtrace was obtained.

However, when I changed the code (added a dummy "Serial.println()"). it said "settings changed, recompiling everything" (translated from Portuguese), and then, when uploaded, the backtrace changed, and now it makes sense.

The error is:

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x400d4c0c  PS      : 0x00060030  A0      : 0x800d4cae  A1      : 0x3ffd5220  
A2      : 0x3ffd56c4  A3      : 0x3ffd17d8  A4      : 0x00000000  A5      : 0x00000000  
A6      : 0x00000000  A7      : 0x3ffbf6bc  A8      : 0x800d4c04  A9      : 0x3ffd5200  
A10     : 0x00000010  A11     : 0x3f401754  A12     : 0x3ffd55e8  A13     : 0x00000009  
A14     : 0x3ffd5ad3  A15     : 0x0000005a  SAR     : 0x00000004  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xffffffff  

Backtrace: 0x400d4c0c:0x3ffd5220 0x400d4cab:0x3ffd5270 0x400d4cb6:0x3ffd5290 0x4008943d:0x3ffd52b0

And the result of the backtrace is:

C:\Users\<user_omitted>\Documents\Arduino\libraries\Arduino_SNMP-master/Arduino_SNMP.h:457
C:\Users\<user_omitted>\Documents\Arduino\libraries\Arduino_SNMP-master/Arduino_SNMP.h:457
C:\Users\<user_omitted>\Documents\Arduino\libraries\Arduino_SNMP-master/Arduino_SNMP.h:457
/Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/port.c:355 (discriminator 1)

Line 457 reads:

ValueCallback* SNMPAgent::addOIDHandler(char* oid, char* value, bool overwritePrefix){
    ValueCallback* callback = new OIDCallback();
    callback->overwritePrefix = overwritePrefix;
    callback->OID = (char*)malloc((sizeof(char) * strlen(oid)) + 1);
    strcpy(callback->OID, oid);
    ((OIDCallback*)callback)->value = value;
    addHandler(callback);
    return callback;
}

(Line 457 is the last line of the function)

Sorry for wasting your time in the previous comment. Annoyingly, the backtrace:

0x50abba12:0x3ffd5140 0x400d4c47:0x3ffd5160 0x400d4cab:0x3ffd51b0 0x400d4cb6:0x3ffd51d0 0x4008943d:0x3ffd51f0

Results in the same result...

0neblock commented 2 years ago

Hey, sorry for the delay, this should be fixed by #23