esphome / issues

Issue Tracker for ESPHome
https://esphome.io/
290 stars 34 forks source link

SML crashes on ZPA smart meter when enabling extended info #6151

Open gitolicious opened 1 month ago

gitolicious commented 1 month ago

The problem

TLDR: SML parser crashes when my smart meter ZPA GH305 sends its extended dataset. Raw data below.

Long version:

As there is a lot of activity going on surrounding the SML parser at the moment (https://github.com/esphome/esphome/pull/6148, https://github.com/esphome/esphome/pull/7235, https://github.com/esphome/issues/issues/6071), I want to share my raw data as requested by @eNBeWe and hope we can find a solution together.

I just got installed a new ZPA GH305.D-S2-01.00-30G by Westnetz (Germany). For reference, here is the manual of another network operator / Netzbetreiber providing the same smart meter: ZPA GH305 and the Tasmota config: https://tasmota.github.io/docs/Smart-Meter-Interface/#zpa-gh305-sml.

ESPHome runs fine and decodes the SML with the reduced dataset (manufacturer code, ID, total consumption, total delivery) but crashes when I input the PIN and enable the extended dataset with the INFO switch which adds a lot of other values to the SML message.

Disabling the SML parser and enabling the UART debug log, I captured the following: (filtered by OBIS messages starting with 77 07 01 00, a few details censored for privacy, let me know if you need the full dump)

77 07 01 00 60 32 01 01 01 01 01 01 04 5A 50 41 01 
77 07 01 00 60 01 00 FF 01 01 01 01 0B xx xx xx xx xx xx xx xx xx xx 01 
77 07 01 00 01 08 00 FF 65 00 1C 01 04 01 62 1E 52 FF 69 00 00 00 00 00 13 F9 12 01 
77 07 01 00 02 08 00 FF 01 01 62 1E 52 FF 69 00 00 00 00 00 01 9C E0 01 
77 07 01 00 0E 07 00 FF 01 01 62 2C 52 FE 69 00 00 00 00 00 00 13 8A 01 
77 07 01 00 00 02 00 00 01 01 01 01 03 30 31 01 
77 07 01 00 60 5A 02 01 01 01 01 01 xx xx xx xx xx 01 
77 07 01 00 61 61 00 FF 01 01 01 01 05 00 00 00 00 01 
77 07 01 00 60 05 00 FF 01 01 01 01 05 00 1C 01 04 01 
77 07 01 00 10 07 00 FF 01 01 62 1B 52 00 59 00 00 00 00 00 00 09 5B 01 
77 07 01 00 24 07 00 FF 01 01 62 1B 52 00 59 00 00 00 00 00 00 00 15 01 
77 07 01 00 38 07 00 FF 01 01 62 1B 52 00 59 00 00 00 00 00 00 01 0E 01 
77 07 01 00 4C 07 00 FF 01 01 62 1B 52 00 59 00 00 00 00 00 00 08 37 01 
77 07 01 00 20 07 00 FF 01 01 62 23 52 FE 69 00 00 00 00 00 00 5D A7 01 
77 07 01 00 34 07 00 FF 01 01 62 23 52 FE 69 00 00 00 00 00 00 5D 8E 01
77 07 01 00 48 07 00 FF 01 01 62 23 52 FE 69 00 00 00 00 00 00 5D 6A 01 
77 07 01 00 1F 07 00 FF 01 01 62 21 52 FD 69 00 00 00 00 00 00 00 73 01 
77 07 01 00 33 07 00 FF 01 01 62 21 52 FD 69 00 00 00 00 00 00 05 3E 01 
77 07 01 00 47 07 00 FF 01 01 62 21 52 FD 69 00 00 00 00 00 00 22 57 01 
77 07 01 00 51 07 01 FF 01 01 62 08 52 FF 59 00 00 00 00 00 00 04 9F 01 
77 07 01 00 51 07 02 FF 01 01 62 08 52 FF 59 00 00 00 00 00 00 09 49 01 
77 07 01 00 51 07 04 FF 01 01 62 08 52 FF 59 00 00 00 00 00 00 0D 6E 01 
77 07 01 00 51 07 0F FF 01 01 62 08 52 FF 59 00 00 00 00 00 00 0D E8 01 
77 07 01 00 51 07 1A FF 01 01 62 08 52 FF 59 00 00 00 00 00 00 0E 0D 01 xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
Expand: decoded messages decoded by [https://tasmota-sml-parser.dicp.net/](https://tasmota-sml-parser.dicp.net/) |OBIS (hex)|OBIS|Name|Wert|Einheit|Parsed| |--- |--- |--- |--- |--- |--- | |0x010000020000|0.2.0|Unbekannter Datentyp|01|Unbekannte Einheit|01Unbekannte Einheit (Unbekannter Datentyp)| |0x0100010800ff|1.8.0|Zählerstand Total|1308946|Wh|130894.6Wh (Zählerstand Total)| |0x0100020800ff|2.8.0|Wirkenergie Total|105696|Wh|10569.6Wh (Wirkenergie Total)| |0x01000e0700ff|14.7.0|Netz Frequenz|5002|Hz|50.02Hz (Netz Frequenz)| |0x0100100700ff|16.7.0|aktuelle Wirkleistung|2395|W|2395W (aktuelle Wirkleistung)| |0x01001f0700ff|31.7.0|Strom L1|115|A|0.115A (Strom L1)| |0x0100200700ff|32.7.0|Spannung L1|23975|V|239.75V (Spannung L1)| |0x0100240700ff|36.7.0|Wirkleistung L1|21|W|21W (Wirkleistung L1)| |0x0100330700ff|51.7.0|Strom L2|1342|A|1.342A (Strom L2)| |0x0100340700ff|52.7.0|Spannung L2|23950|V|239.5V (Spannung L2)| |0x0100380700ff|56.7.0|Wirkleistung L2|270|W|270W (Wirkleistung L2)| |0x0100470700ff|71.7.0|Strom L3|8791|A|8.791A (Strom L3)| |0x0100480700ff|72.7.0|Spannung L3|23914|V|239.14V (Spannung L3)| |0x01004c0700ff|76.7.0|Wirkleistung L3|2103|W|2103W (Wirkleistung L3)| |0x0100510701ff|81.7.1|Phasenabweichung Spannungen L1/L2|1183|°|118.3° (Phasenabweichung Spannungen L1/L2)| |0x0100510702ff|81.7.2|Phasenabweichung Spannungen L1/L3|2377|°|237.7° (Phasenabweichung Spannungen L1/L3)| |0x0100510704ff|81.7.4|Phasenabweichung Strom/Spannung L1|3438|°|343.8° (Phasenabweichung Strom/Spannung L1)| |0x010051070fff|81.7.15|Phasenabweichung Strom/Spannung L2|3560|°|356.0° (Phasenabweichung Strom/Spannung L2)| |0x010051071aff|81.7.26|Phasenabweichung Strom/Spannung L3|3597|°|359.7° (Phasenabweichung Strom/Spannung L3)| |0x0100600100ff|96.1.0|Unbekannter Datentyp|xxxxxxxxxxxxxxxxxxxx|Unbekannte Einheit|xxxxxxxxxxxxxxxxxxxxUnbekannte Einheit (Unbekannter Datentyp)| |0x0100600500ff|96.5.0|Unbekannter Datentyp|001c0104|Unbekannte Einheit|001c0104Unbekannte Einheit (Unbekannter Datentyp)| |0x010060320101|96.50.1|Unbekannter Datentyp|ZPA|Unbekannte Einheit|ZPAUnbekannte Einheit (Unbekannter Datentyp)| |0x0100605a0201|96.90.2|Unbekannter Datentyp|xxxxxxxx|Unbekannte Einheit|xxxxxxxxUnbekannte Einheit (Unbekannter Datentyp)| |0x0100616100ff|97.97.0|Unbekannter Datentyp|00000000|Unbekannte Einheit|00000000Unbekannte Einheit (Unbekannter Datentyp)|

Which version of ESPHome has the issue?

2024.7.3, same with 2024.6.0

What type of installation are you using?

Home Assistant Add-on

Which version of Home Assistant has the issue?

2024.8.1

What platform are you using?

ESP8266

Board

ESP01

Component causing the issue

sml

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

gitolicious commented 1 month ago

@eNBeWe, @passionsfrucht, @irgendwienet, would one of you share your way of debugging SML issues? What is the best way to "replay" the raw data from within ESPHome to identify the problematic line?

irgendwienet commented 1 month ago

Hi @gitolicious

I used my PC connected with an IR reader and a TTL serial to USB converter to record a few seconds of data. Then I extracted some SML records by hand. You could probably use also an ESP to record raw serial data.

Finally I managed to get the SML code running on Windows in Visual Studio and loaded these files. This gave me the ability to debug the code on Windows. That to say: I'm coming from C# background with little experience in plain C.

I found that solution of mine where SmlConsoleApplication.cpp is the entry point and the other files coming from the esphome repo. Maybe this could be a starting point for you sml-debug.zip

This pdf was also useful: TR-03109-1_Anlage_Feinspezifikation_Drahtgebundene_LMN-Schnittstelle_Teilb.pdf

eNBeWe commented 1 month ago

I hacked together a small main() function inside sml_parser where I could just dump hard-coded byte streams. And I built a small library of SML files according to the BMI specifications.

eNBeWe commented 1 month ago

@gitolicious As far as I can tell all the messages you posted pass the parser. So I guess the issue needs to be either in the private data (the xx bytes) or in the surrounding envelope. Could you maybe send a raw dump? I guess it should still be okay if you adjust some bytes (jumble some numbers to other numbers) but keep the "class" of hex values (numbers, digits, etc.)

gitolicious commented 1 month ago

This is valuable input, thanks guys! Let me see if I can find the issue myself with @irgendwienet's helper code. Otherwise I might come back to your offer to look into the full dump.

gitolicious commented 4 weeks ago

Alright, so I found the debug option on PC a very good idea and more comfortable compared to debugging on the ESP. Unfortunately it didn't identify the issue as it decodes the full dump correctly - I would have expected an issue where the ESP crashes.

Using the format I gathered from above, I ended up with this code:

int main(int argc, char* argv[])
{
    std::string hexString = "76 05 00 64 21 ...";

    // remove all spaces from the hex string
    hexString.erase(std::remove(hexString.begin(), hexString.end(), ' '), hexString.end());

    // convert hex string to byte array
    std::vector<uint8_t> byteArray = hex_to_bytes(hexString);

    // parse bytes to SML
    esphome::sml::SmlFile sml_file = esphome::sml::SmlFile(byteArray);
    std::vector<esphome::sml::ObisInfo> obis_info = sml_file.get_obis_info();

    // print result to stdout
    std::cout << "OBIS message size: " << obis_info.size() << std::endl << std::endl;
    for (const auto& info : obis_info) {
        std::cout << std::left << std::setw(12) << std::setfill(' ') << info.code_repr() << "| ";

        for (const auto& byte : info.value) {
            std::cout << std::hex << std::setw(2) << std::setfill('0') << (int)byte;
        }
        std::cout << std::endl;
    }

    return 0;
}

Output:

OBIS message size: 24

1-0:96.50.1 | 5a5041
1-0:96.1.0  | a010xxxxxxxxxxxxxxxx
1-0:1.8.0   | 000000000013f8e3
1-0:2.8.0   | 0000000000109ce0
1-0:14.7.0  | 000000000000138b
1-0:0.2.0   | 3031
1-0:96.90.2 | 7249a01d
1-0:97.97.0 | 00000000
1-0:96.5.0  | 001c1040
1-0:16.7.0  | 0000000000009061
1-0:36.7.0  | 0000000000000015
1-0:56.7.0  | 0000000000001012
1-0:76.7.0  | 0000000000008039
1-0:32.7.0  | 0000000000005dab
1-0:52.7.0  | 0000000000005d98
1-0:72.7.0  | 0000000000005d73
1-0:31.7.0  | 0000000000000073
1-0:51.7.0  | 000000000000504b
1-0:71.7.0  | 000000000000225e
1-0:81.7.1  | 000000000000409f
1-0:81.7.2  | 000000000000904a
1-0:81.7.4  | 000000000000d06e
1-0:81.7.15 | 000000000000d0e7
1-0:81.7.26 | 000000000000e0c0

I guess this means I will need to run it on an ESP directly and see how it performs there. Might be memory related? I am using an ESP01 1MB with Hichi TTL - IR Lesekopf.

eNBeWe commented 4 weeks ago

Well, memory issue would be plausible. I don't think the code is extremely well optimized. I have it running on Olimex ESP32-PoE so memory shouldn't be an issue normally. Do you have any other ESP boards to try?

gitolicious commented 4 weeks ago

Yes, plenty 🤓 Will run the code on NodeMCU, Wemos D1 and ESP32 variants tomorrow and see if it works with more memory.

gitolicious commented 3 weeks ago

I was able to replicate the issue "offline" now. (Yeah, most energy meters are not located in the hacker-friendliest places...)

I wired together two ESPs (UART TX -> UART RX) and sent the recorded hex values from my smart meter SML message, replicating the real SML receiver as closely as possible.

Serial log is showing:

Unhandled C++ exception: OOM

More... > User exception (panic/abort/assert) > ... > Unhandled C++ exception: OOM > ... > last failed alloc call: 4020BBD4(196) > ... > last failed alloc caller: 0x4020bbd4 > > ets Jan 8 2013,rst cause:4, boot mode:(3,6) > > wdt reset > load 0x4010f000, len 3424, room 16 > tail 0 > chksum 0x2e > load 0x3fff20b8, len 40, room 8 > tail 0 > chksum 0x2b > csum 0x2b > v000754b0

So just as we expected - it looks like a memory issue.

Is there anything I can do on my end to dig deeper into the issue, or would a major improvement in memory handling within the SML component be the only way to get it running on ESP8266?

Is there a simple online tool to create SML messages? I could then try and reduce the size of the message to see "how far away" from a working solution I am.

Last resort would be to upgrade the ESP attached to my smart meter to an ESP32. This requires rework on the 3D printed case and wiring which I would like to avoid if possible.

passionsfrucht commented 3 weeks ago

Hey, thanks for taking this up!

I'm unfortunately not able to take things up right now, as I'm not in the vicinity of the reader and remote access is at least difficult. But if additional binary dumps of the streams are necessary, I'm happy to provide at the end of the week.

As an additional data point regarding memory sizes: I'm using the esp32dev board designation for the board, which only defines 320 kB of RAM due to some no-name el-cheapo origin. The board has definitely more RAM, so I can raise the limits manually, or use one of the better board flying around to check if such a simple swap will help eventually.

Another note: The log is showing warnings regularly that the processing of the SML data takes too much time, e.g.,

[20:32:44][W][component:237]: Component sml took a long time for an operation (69 ms).
[20:32:44][W][component:238]: Components should block for at most 30 ms.

Which points to too much data again, IMO.

eNBeWe commented 3 weeks ago

@gitolicious Nice lab-setup and nice find. Too bad that it is indeed memory related. I know of no online sml test generator, I built my test data manually. The SML specifications are actually not toooooo bad to read, so with some patience you could disect the messages and strip them down.

@passionsfrucht The warning about the component taking too long is already "documented" in the corresponding issue. I even have these messages when I use my probe on my smart meter with very few messages (thanks to my energy provider that gave me a seriously cut down meter). Maybe this is more of a problem of the serial transmission. Since the uart port is running at 9600 baud, you can "only" transmit about 280 bytes of data before the component is flagged as "too slow". With additional computation overhead the 30ms are over quick.

gitolicious commented 1 week ago

After finding an ESP32 C3 SuperMini in a drawer, I decided to replace my ESP01 with that. It just needed a minor change in my 3D-printed case and three short wires - and was definetely easier done than hunting memory leaks in the SML library.

Btw: The (expected) component warning from ESPHome states that it takes 100-150ms for the SML lib to parse these long messages.

[W] [component:237] Component sml took a long time for an operation (105 ms). [W] [component:238] Components should block for at most 30 ms.

What do you think: Should I leave this issue open for others to find it - or even someone brave enough to dig into the memory issues - or should I close it as at least for me everything works fine again after the hardware upgrade?

eNBeWe commented 1 week ago

It will be marked stale and auto-close anyway after some time.

But I guess someone should dig in there at some point ... Then again, SML is purely a german protocol and I guess the affected user base is kind of limited.