Lora-net / sx1302_hal

SX1302/SX1303 Hardware Abstraction Layer and Tools (packet forwarder...)
Other
219 stars 272 forks source link

lora_pkt_fwd exit with ERROR: Parity error check failed on AGC firmware #14

Closed jerryyip closed 4 years ago

jerryyip commented 4 years ago

lora_pkt_fwd always exits with the following ERROR after running a few minutes, does anyone know the reason?

ERROR: Parity error check failed on AGC firmware
ERROR: [up] failed packet fetch, exiting
mcoracin commented 4 years ago

Hello, Have you seen that several times ? Which gateway are you using ? How is it powered ?

jerryyip commented 4 years ago

Hi @mcoracin , Yes, it always happens after a few minutes. I am using a sx1302 PCIE module, powered by 3.3V from our LoRa Gateway.

mcoracin commented 4 years ago

Ok, I've never seen this, but I will create an issue in our internal bug tracker, and will investigate.

mcoracin commented 4 years ago

Hi @jerryyip , As it seems that you can easily reproduce the issue on your gateway, would it be possible for you to add some debug info in the sx1302_hal, so that we can narrow down the issue ?

The idea would be to replace the sx1302_init() function you have, with this one:

int sx1302_update(void) {
    int32_t val;

    /* Check MCUs parity errors */
    lgw_reg_r(SX1302_REG_AGC_MCU_CTRL_PARITY_ERROR, &val);
    val = 1;
    if (val != 0) {
        printf("ERROR: Parity error check failed on AGC firmware\n");

        uint8_t fw_check[MCU_FW_SIZE];
        int i;
        lgw_reg_w(SX1302_REG_AGC_MCU_CTRL_MCU_CLEAR, 0x01);
        lgw_reg_w(SX1302_REG_AGC_MCU_CTRL_HOST_PROG, 0x01);
        lgw_reg_w(SX1302_REG_COMMON_PAGE_PAGE, 0x00);
        lgw_mem_rb(AGC_MEM_ADDR, fw_check, MCU_FW_SIZE, false);
        for (i = 0; i < MCU_FW_SIZE; i++) {
            if ((i % 16) == 0) {
                printf("\n");
            }
            printf("%02X ", fw_check[i]);
        }
        printf("\n");

        return LGW_REG_ERROR;
    }
    lgw_reg_r(SX1302_REG_ARB_MCU_CTRL_PARITY_ERROR, &val);
    if (val != 0) {
        printf("ERROR: Parity error check failed on ARB firmware\n");
        return LGW_REG_ERROR;
    }

    /* Update internal timestamp counter wrapping status */
    timestamp_counter_get(&counter_us, false); /* maintain inst counter */
    timestamp_counter_get(&counter_us, true); /* maintain pps counter */

    return LGW_REG_SUCCESS;
}

So that when it fails, we read the fw from memory, and check if it is actually corrupted, or if it is another issue.

You can either put the result logs here, or just compare with what is being written by the sx1302_agc_load_firmware() function. (the content of the "firmware" variable)

Thanks, Michael

jerryyip commented 4 years ago

Hi @mcoracin, I compared the result logs with agc_fw_sx1250.var, and found no difference.

And I found that If I close all the node around and the sx1302 module doesn't receive any packets, this issue won't happen.

mcoracin commented 4 years ago

Hi @jerryyip, Ok, so that means that the AGC is not actualy corrupted. Could you tru to run the test_loragw_spi tool, which you can find in libloragw/, in order to check the stability of your SPI connection in a nominal case ? Please let it run few minutes/hours.

jerryyip commented 4 years ago

Hi @mcoracin This problem hasn't happened in the last 10 days for no reason. Feel free to close this issue, I will update if it occurs again. Thanks.

mcoracin commented 4 years ago

@jerryyip, ok, thank you.

KubaFYI commented 3 years ago

Hi @jerryyip, Ok, so that means that the AGC is not actualy corrupted. Could you tru to run the test_loragw_spi tool, which you can find in libloragw/, in order to check the stability of your SPI connection in a nominal case ? Please let it run few minutes/hours.

I've been running into similar issues and test_loragw_spi does indeed fail usually after a couple minutes. Does that mean that my SPI connection is unstable?