Closed habitats-tech closed 1 year ago
We have many users running on proxmox and in the same instance, without reporting any issues like this. There'a also nothing special with the way Frigate or the integration interacts with MQTT, do you have any other services interacting with MQTT at the same time?
In any case, without logs it will be quite difficult to know which component actually has an issue or how to begin trying to solve it. If other proxmox users have any input that would be helpful as well 👍
Thanks for the quick response. Issue is there is nothing in the logs, except normal startup messages. Letting the VM sit idle (no activity) will eventually crash Proxmox. I will ty to dig deeper and update since there are no other reported issues. Proxmox/Linux kernel does not report any errors either.
MQTT dedicated to Frigate. I have created a new VM with only MQTT and Frigate installed. Will test throughout this week and will hopefully identify the culprit.
I have been able to get closer to the issue. The issue is between MQTT broker (Mosquitto) and Frigate add-on. How I know this:
When this VM is active every about 3 hours it will freeze the entire Proxmox system (I need to power cycle to get the system up). Nothing is recorded in the Proxmox, HA or Frigate logs which could point to the cause of the catastrophic failure.
The Proxmox system is on ZFS and the ZFS pool is clean.
There are two things I have identified. Any HAOS VM which runs both Mosquitto and Frigate add-on freezes every about 3 hrs.
Can you come up with any pointers why such behaviour. The HW is based on an AMD HX5900 and this is the only issue I have ever encountered on this test system.
I am carrying on with digging deeper, but any pointers are welcome.
I confirm deleting the MQTT broker (Mosquitto) from the HAOS instance where the Frigate add-on is installed fixes the issue (no Proxmox freezes). I will carry on testing and update.
I now have (working with no issues for the last 3.5 hrs):
Any attempt to have Mosquitto and Frigate add-ons on the same HAOS instance takes Proxmox down after about 3 hrs.
I have no idea how that would happen or what would lead to that, I'd be curious if it happens with frigate, mosquito just running in docker or a Debian VM with docker.
Like I said previously lots of proxmox and also HA OS users and haven't heard of this before, without any information it's entirely guessing why such a thing would happen.
I have created a Debian LXC and run MQTT/Mosquitto and Docker/Frigate; system/Proxmox crashes. I am going to long term test another scenario with two LXCs one running MQTT/Mosquitto and the other Docker/Frigate, which seemed promising (was running for 4 hours with no issues) until I created the combined one which within half hour system crashed. Not certain which of the two or possibly both are the culprit, so I need to test one at a time. Will update as soon as I have further info.
Any installation method of Mosquitto & Frigate results in eventual Proxmox crash. I think it is a system specific issue, which logs do not seem to capture.
Any installation method of Mosquitto & Frigate results in eventual Proxmox crash. I think it is a system specific issue, which logs do not seem to capture.
Yeah, we have lots of proxmox users and this is the first time this has been reported, seems there must be some other factor that is unique
Just providing an update. There is no issue with Mosquitto per se. The Proxmox freeze arises when Frigate is talking to Mosquitto. It does not matter if the two are running on the same CT/VM or a different CT/VM. I am now testing what happens if the two are on a different physical machine.
I had a Debian Frigate CT talking to a Windows Mosquitto broker on two different machines. Frigate crashed the Proxmox node it was on. So I am now confident the issue is with Frigate. What causes such behaviour is unknown, but I am now working with others who experience similar problems with Intel NUCs and other software.
I wonder if any Proxmox Frigate installation is running under an AMD platform.
I have now completed all testing on AMD 4900H architecture. Proxmox freeze is a certainty only when Frigate (running under Debian 11 CT/VM or HAOS 8.4/2022.8+), is communicating to any Mosquitto MQQT broker (Debian 11 CT/VM, Windows 10/11, HAOS 8.4/2022.8+).
Frigate can coexist with the Mosquitto MQTT broker in the same OS instance, assuming it is not trying to communicate with the broker.
I guess something is causing a Kernel panic which cannot even be captured through the logs. Not certain if you can check the code for some kind of memory leak that builds over time, although sometimes (infrequently) the Proxmox freeze happens as soon as Frigate tries to communicate with a broker. Usually the freeze takes place a couple or hours following the start of communication between Frigate and Mosquitto.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Issue persists (Sep 2022) but unfortunately no solution found so far. The same freezing issue is experienced with any NVR running under Proxmox either as KVM or LXC. Tested NVRs are: AgentDVR Frigate ZoneMinder
The problem seems familiar to me, but unfortunately I haven't been able to find a solution yet. At least now I know what the problem is.
I think the problem even exists on the Rasperry Pi because I recently switched from Raspberry to an Intel NUC and my home assistant froze at irregular intervals so that only a restart using the power button helped.
The problems were worst when everything was installed directly in Homeassistant on both the Raspberry and the NUC. Now I'm running Frigate in an LXC container and the problem only occurs every about 2 days before it only worked for a few hours.
If I were to look for a system then I would say that it often happens when a person is detected in a camera and reported via MQTT.
I seem to be encountering the same behavior running a linux mint VM on an AMD 2920x with GPU passthrough. it had been working relatively well for awhile but recently, as soon as docker containers started, the entire proxmox host becomes 100% unresponsive. pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-6-pve)
I have been running with GPU passthrough with hardware accel versions of various containers so I was only able to stop and remove the frigate container once the gpu was removed from the VM config. now that they have been removed and wont start on boot I am adding the gpu back into the mix to see if that affects anything. Such a strange situation. I'll update if I find any new information.
looks like adding the GPU back in immediately causes a hard crash of the proxmox host. at this point I've got a new PSU inbound just to completely rule that possibility out. It had been running fine for several weeks so bad PSU is the only hting I can think of that might cause this sudden change in behavior
I've attached the same GPU to another fresh linux mint vm with no hard crash now. I was also able to determine that the original VM even with frigate and wyze bridge turned off now crashes the system right after uefi boot.
Any update on this? I am experiencing the same issuer under almost identical circumstances - Debian CT (although 12, not 11), Proxmox, HAOS VM, Frigate (integration and add-on), MQTT, and a ZFS pool.
Seems I am experiencing the same. Migrated a HA install from bare-metal to proxmox. It has since completely frozen proxmox or outputted random kernel dumps in syslog. I´ve tried just about every mitigation from microcode ,acpi, cstates, i915 gpu tweaks to disabling EEE before finding this thread. Uninstalling frigate from HA makes the issue disappear.
Proxmox, ZFS (raid1), HA with MQTT and Frigate on an Intel N100 mini-pc. No USB or GPU passthrough.
Edit: also did a 24hour memtest run, no errors reported.
Examples of crashes with dumps:
`Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.560605] BUG: unable to handle page fault for address: ffffffffbff8a2a0
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.567515] #PF: supervisor instruction fetch in kernel mode
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.573198] #PF: error_code(0x0010) - not-present page
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.578355] PGD 2be039067 P4D 2be039067 PUD 2be03a063 PMD 0
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.584032] Oops: 0010 [#1] PREEMPT SMP NOPTI
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.588411] CPU: 2 PID: 1859 Comm: vhost-1835 Tainted: P U O 6.5.11-8-pve #1
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.596687] Hardware name: Default string Default string/Default string, BIOS 5.27 09/28/2023
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.605231] RIP: 0010:0xffffffffbff8a2a0
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.609217] Code: Unable to access opcode bytes at 0xffffffffbff8a276.
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.615757] RSP: 0018:ffffab5b47257cf0 EFLAGS: 00010282
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.621006] RAX: ffff972e946394b0 RBX: ffff972e946300c0 RCX: 0000000000000000
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.628154] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff972e946394b0
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.635307] RBP: ffffab5b47257e68 R08: 0000000000000000 R09: 0000000000000000
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.642461] R10: 0000000000000000 R11: 0000000000000000 R12: ffff972e946300d0
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.649608] R13: 0000000000000000 R14: ffff972e94630000 R15: 0000000000000000
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.656762] FS: 00007fef9815c4c0(0000) GS:ffff97359fb00000(0000) knlGS:0000000000000000
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.664869] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.670641] CR2: ffffffffbff8a276 CR3: 000000011fa98000 CR4: 0000000000752ee0
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.677789] PKRU: 55555554
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.680521] Call Trace:
Feb 11 14:19:09 10.88.89.252 kernel: [ 3613.682993]
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.462323] general protection fault, probably for non-canonical address 0xffff3993e44e4ab8: 0000 [#1] PREEMPT SMP NOPTI
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.473226] CPU: 2 PID: 1402 Comm: vhost-1377 Tainted: P U O 6.5.11-8-pve #1
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.481508] Hardware name: Default string Default string/Default string, BIOS 5.27 09/28/2023
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.490051] RIP: 0010:vhost_tx_batch.constprop.0+0x93/0x260 [vhost_net]
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.496699] Code: 47 20 48 8b 80 88 00 00 00 ff d0 0f 1f 00 85 c0 78 61 8b 8b bc 49 00 00 85 c9 75 39 c7 83 c0 49 00 00 00 00 00 00 48 8b 45 e0 <65> 48 2b 04 25 28 00 00 00 0f 85 ab 01 00 00 48 83 c4 20 5b 41 5c
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.515470] RSP: 0018:ffffb9b887fdbcf0 EFLAGS: 00010246
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.520716] RAX: b3cc1c041d2aa200 RBX: ffff9cc6449e4ab8 RCX: 0000000000000000
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.527866] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.535019] RBP: ffffb9b887fdbd28 R08: 0000000000000000 R09: 0000000000000000
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.542175] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cc6449e0000
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.549330] R13: 0000000000000280 R14: ffff9cc64c369b40 R15: ffff9cc6449e0000
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.556489] FS: 00007ff3a33274c0(0000) GS:ffff9ccd9fb00000(0000) knlGS:0000000000000000
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.564598] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.570364] CR2: 00007fa8309ba2c8 CR3: 000000010b620000 CR4: 0000000000752ee0
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.577516] PKRU: 55555554
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.580253] Call Trace:
Feb 10 10:57:30 10.88.89.252 kernel: [ 3139.582724]
Feb 9 17:37:31 10.88.89.252 kernel: [77589.722381] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Feb 9 17:37:31 10.88.89.252 kernel: [77589.728523] rcu: 0-...0: (18 ticks this GP) idle=04f4/1/0x4000000000000000 softirq=1170560/1170560 fqs=3021
Feb 9 17:37:31 10.88.89.252 kernel: [77589.738370] rcu: hardirqs softirqs csw/system
Feb 9 17:37:31 10.88.89.252 kernel: [77589.743964] rcu: number: 0 0 0
Feb 9 17:37:31 10.88.89.252 kernel: [77589.749562] rcu: cputime: 0 0 0 ==> 30020(ms)
Feb 9 17:37:31 10.88.89.252 kernel: [77589.756544] rcu: 2-...0: (25 ticks this GP) idle=9984/1/0x4000000000000000 softirq=1163170/1163171 fqs=3022
Feb 9 17:37:31 10.88.89.252 kernel: [77589.766385] rcu: hardirqs softirqs csw/system
Feb 9 17:37:31 10.88.89.252 kernel: [77589.771982] rcu: number: 0 0 0
Feb 9 17:37:31 10.88.89.252 kernel: [77589.777575] rcu: cputime: 0 0 0 ==> 30020(ms)
Feb 9 17:37:31 10.88.89.252 kernel: [77589.784557] rcu: (detected by 3, t=15009 jiffies, g=1935437, q=12163 ncpus=4)
Feb 9 17:37:31 10.88.89.252 kernel: [77589.791801] Sending NMI from CPU 3 to CPUs 0:
Feb 9 17:37:31 10.88.89.252 kernel: [77598.441949] watchdog: Watchdog detected hard LOCKUP on cpu 1
Feb 9 17:37:31 10.88.89.252 kernel: [77598.441951] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables 8021q garp mrp softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink intel_rapl_msr soundwire_cadence intel_rapl_common snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_hdmi snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core x86_pkg_temp_thermal intel_powerclamp snd_soc_acpi_intel_match coretemp snd_soc_acpi soundwire_generic_allocation kvm_intel soundwire_bus snd_soc_core kvm snd_compress ac97_bus snd_pcm_dmaengine irqbypass crct10dif_pclmul polyval_clmulni polyval_generic snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg snd_intel_sdw_acpi aesni_intel snd_hda_codec i915 crypto_simd snd_hda_core cryptd snd_hwdep cmdlinepart snd_pcm drm_buddy ttm mei_pxp mei_hdcp spi_nor drm_display_helper snd_timer cp210x ch341 cec rapl rc_core pcspkr
Feb 9 17:37:31 10.88.89.252 kernel: [77598.441991] intel_cstate snd mtd wmi_bmof mei_me drm_kms_helper soundcore usbserial i2c_algo_bit mei acpi_tad acpi_pad mac_hid vhost_net vhost vhost_iotlb tap efi_pstore drm dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb nvme i2c_i801 crc32_pclmul spi_intel_pci nvme_core spi_intel igc nvme_common i2c_smbus xhci_pci ahci xhci_pci_renesas libahci xhci_hcd video wmi
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442014] CPU: 1 PID: 16 Comm: rcu_preempt Tainted: P U O 6.5.11-8-pve #1
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442016] Hardware name: Default string Default string/Default string, BIOS 5.27 09/28/2023
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442017] RIP: 0010:native_queued_spin_lock_slowpath+0x7f/0x2d0
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442025] Code: 00 00 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 5f 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442026] RSP: 0018:ffffb910c0163da8 EFLAGS: 00000002
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442027] RAX: 0000000000000001 RBX: ffffffff84d647c0 RCX: 0000000000000000
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442028] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff84d647c0
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442029] RBP: ffffb910c0163dc8 R08: 0000000000000000 R09: 0000000000000000
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442030] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442031] R13: ffff99f5c0c499c0 R14: 0000000000000000 R15: 0000000000000000
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442032] FS: 0000000000000000(0000) GS:ffff99fd1fa80000(0000) knlGS:0000000000000000
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442034] CR2: 000055b4c8ec4d9c CR3: 00000004f9c34000 CR4: 0000000000752ee0
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442035] PKRU: 55555554
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442036] Call Trace:
Feb 9 17:37:31 10.88.89.252 kernel: [77598.442037]
`
Seems like an N100 issue or some king of RAM issue. Try to revert to kernel 5.x, or install Proxmox 7.x
Thanks for the response but as memtest ran successfully and the system is 100% stable without Frigate I will just migrate to something else. Just leaving my experiences for future reference.
I may be having an issue similar to the ones reported in this thread.
I had been running Frigate on Proxmox on an older Dell XPS Desktop with an Intel i7-7700 with a Nvidia P2000 GPU passed through. This setup had been running for months just fine.
I just built a new Proxmox host with a Ryzen 3900x(leftover from upgrading my main Desktop) on an Asrock B550M Pro4 board with the same P2000 GPU.
Ever since putting the new system together and setting up my Frigate VM it's been crashing/freezing at random intervals with nothing incriminating in the Proxmox log. Sometimes it runs for days and other times just hours. The VM running Frigate is Rocky Linux 9 with the Docker CE repo added and the Nvidia drivers installed. I brought the same VM over from the old setup.
Anyway I'm going to do some additional troubleshooting and see if I can get a stable setup. Unfortunately I can't run the old system in parallel as I had to gut most of the good parts from it to make the new one.
Anybody could find a solution? I think it's some sort of bug with video streaming in proxmox with AMD processors. I am not using Frigate but I have a Ubuntu VM TVendheand server with proxmox 8.1 and AMD FX-8320E . Whenever I'm streaming from that with VM's after 10/15/20 minutes system freezes and no acess to VM or Proxmox
Describe the problem you are having
I have been trying to get to the bottom of an issue where Proxmox randomly crashes. Following several days of troubleshooting I have come to the realisation if you run the Frigate HA add-on + Frigate HACS Integration + Frigate HA Integration on the same HAOS instance after a while the entire Proxmox server becomes unresponsive and only way out is physical power cycle the machine Proxmox is installed. I provide version numbers below, but to me it seems some kind of kernel panic type of error.
It seems I have overcome this issue by splitting Frigate on two HA instances. One instance runs MQTT and Frigate add-on, the other HA instance runs Frigate Proxy, Frigate HACS integration + card and Frigate HA integration. I have yet to try on a pure Debian installation, however, I am confident this will also work.
To me it seems a fatal conflict between MQTT, Frigate HA add-on + Frigate HACS Integration + Frigate HA integration running on the same HA instance, bringing the Proxmox server down (no error messages are logged in Proxmox or HAOS),
I am testing on a simple Frigate installation using just one camera stream.
Everything below is on latest production (non-beta) versions as of 15 Aug 2022.
Frigate System: HW: AMD HX5900 - allocated to HAOS/Frigate VM: 8-cores/8GB RAM/64GB SSD DSK
Latest HAOS with following add-ons (all latest production versions):
Proxmox 7.2-7 no-subscription, up to date with latest packages (15 Aug 2022). I have yet to test on Proxmox with subscription. HAOS 8.4
HACS
I have tried to troubleshoot and it seems MQTT interacting with Frigate HA add-on and Frigate HACS integration is the culprit. I am still testing as we speak, and will update as soon as I have valid input to provide. Having split the add-on from the integration seems to have resolved the issue (no problem for several hours), but the acid test will be tomorrow; if no crash then the issue is in the interaction between the three components mentioned earlier (MQTT, add-on, HACS).
Version
DEBUG 0.10.1-83481AF
Frigate config file
Relevant log output
FFprobe output from your camera
Frigate stats
Operating system
Proxmox
Install method
HassOS Addon
Coral version
CPU (no coral)
Network connection
Wired
Camera make and model
Sonoff
Any other information that may be helpful
All logs Proxmox, HA, Frigate are clean. No errors.