Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.82k stars 1.64k forks source link

Zigbee2MQTT service stops working - FATAL ERROR #14853

Closed pauloon closed 1 year ago

pauloon commented 1 year ago

What happened?

After last update I've noticed Z2M is stopping service with a fatal error, out of the blue.

I'm using HASS.IO all updated, in a i5 machine with 8 Gb do memory and 128 GB SSD.

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

Just leave it running. After one or two days it stops working (service drops).

Zigbee2MQTT version

1.28.2 commit: unknown

Adapter firmware version

20220219

Adapter

SONOFF USB Dongle

Debug log

I did not have the debug log active when this happened. Posting normal log. ... Zigbee2MQTT:info 2022-11-07 11:45:15: MQTT publish: topic 'zigbee2mqtt/Smart Plug 15', payload '{"child_lock":"UNLOCK","current":0.04,"energy":6.45,"indicator_mode":"off/on","last_seen":"2022-11-07T11:45:13-03:00","linkquality":102,"power":0,"power_outage_memory":"restore","state":"ON","update":{"state":"idle"},"update_available":false}' <--- Last few GCs ---> [7:0x7fa21c7993c0] 61013533 ms: Mark-sweep 2044.2 (2085.3) -> 2042.2 (2085.3) MB, 2087.8 / 0.0 ms (average mu = 0.133, current mu = 0.010) allocation failure scavenge might not succeed [7:0x7fa21c7993c0] 61015639 ms: Mark-sweep 2044.3 (2085.3) -> 2042.2 (2085.3) MB, 2082.8 / 0.0 ms (average mu = 0.074, current mu = 0.011) allocation failure scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

pauloon commented 1 year ago

@pauloon it seems like i've resolved the issue here.

The issue was that I had some power plugs with consumption-statistics that were pushing these statistics every second. After changing this to once every 10 seconds, my Z2M no longer seems to crash and the memory is at a table level.

@XanderTenBoden , Or maybe you reduced the traffic so it will take much longer to crash?

Like, in the test pc that I've configured, and memory is still building up, only much slower? image

This pc only has one scene button configured, and a smart plug.

If I have one or two plugs reporting at every 5 seconds, shouldn't this be beareable for Z2M?

That is very frustrating.

Rgds, Paulo.

XanderTenBoden commented 1 year ago

@pauloon I will watch what happens, but it seems there is no increase in memory consumption anymore. Not even a slight one. In fact, it went down with 40MB over the last 2 hours.

I think what happened (at least here) is that the power consumption reports were too much to handle for my coordinator. So it builds up a queue in memory. That might also explain why my lights stopped working after a certain amount of time, because the queue was simply getting too long. Not 100% sure this is what happened, but it makes sense in my head.

It might also mean that in your case, maybe your network is too big for the type of coordinator you are using? But that's only a wild guess though.

dunkelz commented 1 year ago

@XanderTenBoden I have same problem as you guys. This must be some regression as it was working fine until 1.28.0-1, and is still running since I reverted to that version.

pauloon commented 1 year ago

But 50 devices too big?

Anyway, a brand new install with only 2 devices (yes, one is reporting power consumption), and it is still building up memory. Still looks like a bug to me... @Koenkk could take a look at this for us.

Thanks, Paulo.

XanderTenBoden commented 1 year ago

@pauloon maybe you can try what happens if you disable the power consumption reporting? Could be that something has changed in the handling of that since the new release?

@dunkelz I didn't have problems before that release either. However downgrading to 1.28.x didn't resolve the problem for me either 🤷‍♂️

Koenkk commented 1 year ago

What are the models of the plugs you removed from the network? There is a memory leak somewhere (and this should be fixed)

XanderTenBoden commented 1 year ago

@Koenkk TS011F_plug_3

Koenkk commented 1 year ago

@XanderTenBoden

Could you check if the issue is fixed with the following external converter (energy measurement won't work anymore)

const fz = require('zigbee-herdsman-converters/converters/fromZigbee');
const tz = require('zigbee-herdsman-converters/converters/toZigbee');
const exposes = require('zigbee-herdsman-converters/lib/exposes');
const reporting = require('zigbee-herdsman-converters/lib/reporting');
const extend = require('zigbee-herdsman-converters/lib/extend');
const ota = require('zigbee-herdsman-converters/lib/ota');
const tuya = require('zigbee-herdsman-converters/lib/tuya');
const utils = require('zigbee-herdsman-converters/lib/utils');
const e = exposes.presets;
const ea = exposes.access;

const TS011Fplugs = ['_TZ3000_5f43h46b', '_TZ3000_cphmq0q7', '_TZ3000_dpo1ysak', '_TZ3000_ew3ldmgx', '_TZ3000_gjnozsaz',
    '_TZ3000_jvzvulen', '_TZ3000_mraovvmm', '_TZ3000_nfnmi125', '_TZ3000_ps3dmato', '_TZ3000_w0qqde0g', '_TZ3000_u5u4cakc',
    '_TZ3000_rdtixbnu', '_TZ3000_typdpbpg', '_TZ3000_kx0pris5', '_TZ3000_amdymr7l', '_TZ3000_z1pnpsdo', '_TZ3000_ksw8qtmt',
    '_TZ3000_1h2x4akh', '_TZ3000_9vo5icau', '_TZ3000_cehuw1lw', '_TZ3000_ko6v90pg', '_TZ3000_f1bapcit', '_TZ3000_cjrngdr3',
    '_TZ3000_zloso4jk', '_TZ3000_r6buo8ba', '_TZ3000_iksasdbv', '_TZ3000_idrffznf', '_TZ3000_okaz9tjs', '_TZ3210_q7oryllx',
    '_TZ3000_ss98ec5d', '_TZ3000_gznh2xla', '_TZ3000_hdopuwv6', '_TZ3000_gvn91tmx', '_TZ3000_dksbtrzs', '_TZ3000_b28wrpvx',
    '_TZ3000_aim0ztek', '_TZ3000_mlswgkc3', '_TZ3000_7dndcnnb', '_TZ3000_waho4jtj', '_TZ3000_nmsciidq', '_TZ3000_jtgxgmks',
    '_TZ3000_rdfh8cfs', '_TZ3000_yujkchbz', '_TZ3000_fgwhjm9j', '_TZ3000_qeuvnohg', '_TZ3000_rul9yxcc'];

const fzLocal = {
    metering_skip_duplicate: {
        ...fz.metering,
        convert: (model, msg, publish, options, meta) => {
            if (utils.hasAlreadyProcessedMessage(msg, model)) return;
            return fz.metering.convert(model, msg, publish, options, meta);
        },
    },
    electrical_measurement_skip_duplicate: {
        ...fz.electrical_measurement,
        convert: (model, msg, publish, options, meta) => {
            if (utils.hasAlreadyProcessedMessage(msg, model)) return;
            return fz.electrical_measurement.convert(model, msg, publish, options, meta);
        },
    },
}
const definition =     {
    fingerprint: [].concat(...TS011Fplugs.map((manufacturerName) => {
        return [160, 69, 68, 65, 64].map((applicationVersion) => {
            return {modelID: 'TS011F', manufacturerName, applicationVersion};
        });
    })),
    model: 'TS011F_plug_3',
    description: 'Smart plug (with power monitoring by polling)',
    vendor: 'TuYa',
    whiteLabel: [{vendor: 'VIKEFON', model: 'TS011F'}, {vendor: 'BlitzWolf', model: 'BW-SHP15'},
        {vendor: 'Avatto', model: 'MIUCOT10Z'}, {vendor: 'Neo', model: 'NAS-WR01B'}],
    ota: ota.zigbeeOTA,
    fromZigbee: [fz.on_off, fzLocal.electrical_measurement_skip_duplicate, fzLocal.metering_skip_duplicate, fz.ignore_basic_report,
        fz.tuya_switch_power_outage_memory, fz.ts011f_plug_indicator_mode, fz.ts011f_plug_child_mode],
    toZigbee: [tz.on_off, tz.tuya_switch_power_outage_memory, tz.ts011f_plug_indicator_mode, tz.ts011f_plug_child_mode],
    configure: async (device, coordinatorEndpoint, logger) => {
        await tuya.configureMagicPacket(device, coordinatorEndpoint, logger);
        const endpoint = device.getEndpoint(1);
        endpoint.saveClusterAttributeKeyValue('haElectricalMeasurement', {acCurrentDivisor: 1000, acCurrentMultiplier: 1});
        endpoint.saveClusterAttributeKeyValue('seMetering', {divisor: 100, multiplier: 1});
        device.save();
    },
    options: [exposes.options.measurement_poll_interval()],
    exposes: [e.switch(), e.power(), e.current(), e.voltage().withAccess(ea.STATE),
        e.energy(), exposes.enum('power_outage_memory', ea.ALL, ['on', 'off', 'restore'])
            .withDescription('Recover state after power outage'),
        exposes.enum('indicator_mode', ea.ALL, ['off', 'off/on', 'on/off', 'on'])
            .withDescription('Plug LED indicator mode'), e.child_lock()],
    // onEvent: (type, data, device, options) =>
    //     tuya.onEventMeasurementPoll(type, data, device, options, true, device.applicationVersion === 160),
};

module.exports = definition;
pauloon commented 1 year ago

Update: image

Always up! Hehehe And watchdog makes it:

image

XanderTenBoden commented 1 year ago

@Koenkk I will try this when I'm home after work and let you know!

pauloon commented 1 year ago

@Koenkk ,

See that the clean install that I have running on another computer is still slowly building up memory: image

It only has two devices: image

Thanks for helping, Paulo.

XanderTenBoden commented 1 year ago

@Koenkk ,

See that the clean install that I have running on another computer is still slowly building up memory: image

It only has two devices: image

Thanks for helping, Paulo.

Seems that it is indeed that power monitoring then, since you've got the exact same plug as I have mentioned earlier. Can you also try @Koenkk 's solution that he posted earlier?

pauloon commented 1 year ago

Seems that it is indeed that power monitoring then, since you've got the exact same plug as I have mentioned earlier. Can you also try @Koenkk 's solution that he posted earlier?

Yours is a "plug 3" and mine is a "plug 1". Data is all different. Mine pushes energy monitoring through reporting, yours uses polling.

XanderTenBoden commented 1 year ago

@pauloon yes, the difference is the firmware of the plug, otherwise they are the same AFAIK. Although I'm don't think Koen's solution will make a difference in your case though, because his script only affects "plug 3" 🤔

XanderTenBoden commented 1 year ago

@XanderTenBoden

Could you check if the issue is fixed with the following external converter (energy measurement won't work anymore)

@Koenkk To be sure: do you mean the HA configuration.yaml, or the Z2M configuration.yaml? I guess the last one?

pauloon commented 1 year ago

@XanderTenBoden Could you check if the issue is fixed with the following external converter (energy measurement won't work anymore)

@Koenkk To be sure: do you mean the HA configuration.yaml, or the Z2M configuration.yaml? I guess the last one?

Let me help you. It's the Z2M configuration.

And the ext_converter file you put inside the Z2M folder too. Hope it helps.

Koenkk commented 1 year ago

Z2M configuraiton.yaml indeed.

XanderTenBoden commented 1 year ago

@Koenkk I juist changed the reporting of 8 of my plugs back to once a second 15 minutes ago to confirm that it would indeed result in a big increase of memory consumption. If that happens again I will load your file as suggested and see what happens then.

Koenkk commented 1 year ago

@XanderTenBoden ah once a second can explain the issue, the polling happens fast than the network/device can handle which will causes the memory increase. Let me know if this fixed it, then I will push a fix.

XanderTenBoden commented 1 year ago

Even without the fix and the plugs set to once in 10 seconds there seems still to be a memory increase though. Just at a waaaay slower rate. This is what it looked like just before I changed them:

memory.png

XanderTenBoden commented 1 year ago

@Koenkk It results in an error when starting Z2M:

/app/dist/util/externally-loaded.js:13
    fingerprint: [].concat(...TS011Fplugs.map((manufacturerName) => {
                              ^
ReferenceError: TS011Fplugs is not defined
    at /app/dist/util/externally-loaded.js:13:31
    at Script.runInContext (node:vm:141:12)
    at Script.runInNewContext (node:vm:146:17)
    at Object.runInNewContext (node:vm:306:38)
    at loadModuleFromText (/app/lib/util/utils.ts:148:8)
    at loadModuleFromFile (/app/lib/util/utils.ts:155:12)
    at Object.getExternalConvertersDefinitions (/app/lib/util/utils.ts:165:25)
    at getExternalConvertersDefinitions.next (<anonymous>)
    at new ExternalConverters (/app/lib/extension/externalConverters.ts:12:20)
    at new Controller (/app/lib/controller.ts:84:58)
Koenkk commented 1 year ago

Updated https://github.com/Koenkk/zigbee2mqtt/issues/14853#issuecomment-1321196668

XanderTenBoden commented 1 year ago

Updated #14853 (comment)

[20:38:14] INFO: Starting Zigbee2MQTT...
/app/dist/util/externally-loaded.js:33
    fromZigbee: [fz.on_off, fzLocal.electrical_measurement_skip_duplicate, fzLocal.metering_skip_duplicate, fz.ignore_basic_report,
                            ^
ReferenceError: fzLocal is not defined
    at /app/dist/util/externally-loaded.js:33:29
    at Script.runInContext (node:vm:141:12)
    at Script.runInNewContext (node:vm:146:17)
    at Object.runInNewContext (node:vm:306:38)
    at loadModuleFromText (/app/lib/util/utils.ts:148:8)
    at loadModuleFromFile (/app/lib/util/utils.ts:155:12)
    at Object.getExternalConvertersDefinitions (/app/lib/util/utils.ts:165:25)
    at getExternalConvertersDefinitions.next (<anonymous>)
    at new ExternalConverters (/app/lib/extension/externalConverters.ts:12:20)
    at new Controller (/app/lib/controller.ts:84:58)
pauloon commented 1 year ago

Just one comment.... I'm not sure this problem is connected to the "power consumption" because I noticed that in my test PC the power reporting of my plug is disabled, and memory keeps rising...

image

I guess that more data makes memory rise faster, so power consumption contributes, but is not directly connected to it.

pauloon commented 1 year ago

@Koenkk , Many times I have these restarts due to the memory crash, some devices seems to "loose pairing" and need to be manually repaired. Is this expected?

Thanks, Paulo.

XanderTenBoden commented 1 year ago

@pauloon same here, mainly (battery powered) movement sensors stopped working after these crashes.

Koenkk commented 1 year ago

https://github.com/Koenkk/zigbee2mqtt/issues/14853#issuecomment-1321196668 should be good now.

Many times I have these restarts due to the memory crash, some devices seems to "loose pairing" and need to be manually repaired. Is this expected?

Shouldn't be the case, but lets first fix the crash itself.

pauloon commented 1 year ago

@Koenkk , Did you have any success replicating the problem, so you can analyze?

Please, let us know. I'm telling other friends to stand by, that this is being fixed. If you need help testing stuff, please tell me.

Thanks, Paulo.

Koenkk commented 1 year ago

@pauloon it doesn't happen in my setup, the only ones impacted seems to be you two (I haven't gotten any other reports)

pauloon commented 1 year ago

@pauloon it doesn't happen in my setup, the only ones impacted seems to be you two (I haven't gotten any other reports)

But do you use specifically HassOS with the Z2M AddOn?

I participate in a group about HA with more than 2.000 people and most of the ones that use HassOS and Z2M Add-On are having this exact same problem. They hadn't noticed before because they are not very "technical", only have a few devices (so it takes a long time to crash) and they use watchdog, so it restarts automatically. Many people just restart HA and life goes on until it takes a long time to crash again. And they are not familiar with this GitHub and don't know where to look to evaluate this further. The ones that are most impacted are switching to ZHA due to this.

It could be an individual problem indeed, but from the moment I got another PC here at my home, installed a brand new HassOS and Z2M from zero, without any customization, and the problem also happens, it is pretty much an indication of a bug, correct?

If you would like to take closer look at this install that I've setup, I can provide remote access so you investigate. But, please, do not let this go. I really like Z2M and would love to have it working fine.

Please, let me know.

Thanks, Paulo.

pauloon commented 1 year ago

This is how it is affecting my HA use:

image

Koenkk commented 1 year ago

But do you use specifically HassOS with the Z2M AddOn?

I use Z2M in docker since I run an unsupervised HA.

I'm wondering if maybe the logging causes this buildup. Can you try to set the log_level to error and see if it takes longer before the crash happens?

pauloon commented 1 year ago

But do you use specifically HassOS with the Z2M AddOn?

I use Z2M in docker since I run an unsupervised HA.

I'm wondering if maybe the logging causes this buildup. Can you try to set the log_level to error and see if it takes longer before the crash happens?

Just did that. Do I keep the zigbee_herdsman_debug on?

Thanks, Paulo.

XanderTenBoden commented 1 year ago

@Koenkk @pauloon sorry for the silence, I've had some very busy days with work and didn't have time to come back to this issue earlier.

At this moment, I was still running Z2M without the changes @Koenkk provided, and with the plugs set to poll every 10 seconds instead of every one second. This stopped Z2M from crashing for me. However, there is still an increase in RAM usage going on. Just at a much slower phase:

Last 2 days RAM history

I have just added the changes @Koenkk provided and rebooted Z2M, which now seems to work without errors. I also changed a couple of plugs back to polling once a second. I will keep you guys updated about what happens now :-)

pauloon commented 1 year ago

But do you use specifically HassOS with the Z2M AddOn?

I use Z2M in docker since I run an unsupervised HA.

I'm wondering if maybe the logging causes this buildup. Can you try to set the log_level to error and see if it takes longer before the crash happens?

I changed the log and restarted it, but it looks like it did not change anything: image

Still "eating" a bunch of memory very fast...

Please, let me know. Paulo.

XanderTenBoden commented 1 year ago

@Koenkk I think it's safe to assume that your change indeed solves the issue. I've rebooted Z2M yesterday when I posted my previous comment, and the RAM build-up is no longer happening now:

RAM usage graph 25-11

It also appears to be way less "spikey" now for some reason.

pauloon commented 1 year ago

Update, after turning off logs:

image

XanderTenBoden commented 1 year ago

7 hours further, and the line is still as horizontal. So I'm certain that you're in the right direction @Koenkk :-)

Koenkk commented 1 year ago

@XanderTenBoden I've pushed the fix, check if electrical measurements work and if there is no memory buildup with the latest dev.

Changes will be available in the dev branch in a few hours from now. (https://www.zigbee2mqtt.io/advanced/more/switch-to-dev-branch.html)

@pauloon there are more TuYa devices using this polling method, I've applied the fix for all now (maybe you have more TuYa devices using this, the converter I provided only fixes it for TS011F_plug_3)

pauloon commented 1 year ago

Dear @Koenkk ,

Super! If I install this, does it work the same as the link you provided? image I'm not familiar with Linux so I'm not very comfortable to use terminal commands.

Thanks, Paulo.

Koenkk commented 1 year ago

Yes that is the correct addon

pauloon commented 1 year ago

Yes that is the correct addon

Dear @Koenkk ,

It looks like it was a success! image This is so great.

Question: even installing the Edge version as add-on, it does not get updated automatically?

Thanks, Paulo.

pauloon commented 1 year ago

This is the other test PC:

image

XanderTenBoden commented 1 year ago

@XanderTenBoden I've pushed the fix, check if electrical measurements work and if there is no memory buildup with the latest dev.

Changes will be available in the dev branch in a few hours from now. (https://www.zigbee2mqtt.io/advanced/more/switch-to-dev-branch.html)

How do I switch between these 2 add-ons (normal vs edge) without having to setup all my devices again? Does this just work by installing the additional one and disabling the normal one and enabling the edge one?

pauloon commented 1 year ago

@XanderTenBoden ,

Yes, that works. That's how I did it here. But don't forget to disable the "Start on boot" also, for the normal one.

Ai first you get some "Bad gateway" when trying to access, but I do make a few SHIFT + F5 or CTRL + F5 to refresh cache, and it worked.

Please let us know. Paulo.

XanderTenBoden commented 1 year ago

@pauloon @Koenkk I've just switched to the edge repo. I will let you know what happens next :-)

Koenkk commented 1 year ago

Awesome, this fix will be included in the 1 December release.

pauloon commented 1 year ago

@Koenkk ...

Can you confirm if the "Edge" version is updated automatically algo? Or just the regular?

XanderTenBoden commented 1 year ago

@Koenkk it seems that the issue has been resolved. I've just checked RAM usage again and it has been more or less stable for the last 8 hours (It shows only a very slight increase of 20-25MB RAM usage with 8 plugs pulling every second.)

Koenkk commented 1 year ago

@pauloon edge does not update automatically, it is not versioned so you need to uninstall -> install to update.

Great that this has been solved!