blakeblackshear / frigate

NVR with realtime local object detection for IP cameras
https://frigate.video
MIT License
18.09k stars 1.65k forks source link

Frigate continue working after detector not available #2904

Open bartekd123 opened 2 years ago

bartekd123 commented 2 years ago

I want Frigate to be able to continue working (for RTMP stream, loop recording, viewable in HomeAssistant, etc) if it is not able to find a detector

Is it possible to have the detector continue working if it is unable to find a CPU or Coral detector?

I use it on a synology, and things work ok for a while, but the USB sometimes get disconnected, and I have to manually go into virtual machine manager to re-connect it. Until I do that, since Frigate is unable to find a detector, it constantly restarts.

Also on reboots, I need to reconnect the USB within the VM Manager.

NickM-27 commented 2 years ago

It can work with detect disabled, but I'm not sure it makes sense to have it silently keep working if TPU is unplugged or at least not by default. Opposite of this is "frigate is recording but not detecting anything and I don't know why"

bartekd123 commented 2 years ago

OK Thanks. I suppose if detect is enabled, then it will fail if the USB is no longer connected.

Is there any way to prevent it from restarting constantly if it can't connect to the TPU?

I can add CPU as a backup detector. Anyway to make that only be used if it can't process it by the coral TPU?

NickM-27 commented 2 years ago

OK Thanks. I suppose if detect is enabled, then it will fail if the USB is no longer connected.

Is there any way to prevent it from restarting constantly if it can't connect to the TPU?

I can add CPU as a backup detector. Anyway to make that only be used if it can't process it by the coral TPU?

It's interesting because I just tested it and on Unraid unplugging the usb shut the docker container down and it didn't restart. Is it possible this is a setting in your host that would disable automatic restarting? Otherwise I agree that might be preferred or configurable behavior

bartekd123 commented 2 years ago

It does constantly restart if it can't connect to the TPU. Here is my string of events

I am not actually stopping the container.

What I would want is an option to leave the container online if the TPU is unavailable, or no longer connected. Im not sure if it makes sense, or if it is even worth the effort, or possible?

NickM-27 commented 2 years ago

Perhaps I wasn't clear, what I'm saying is in my case if the TPU goes offline the container automatically stops and does not keep restarting so I'm wondering why there is a difference in behavior between our systems.

bartekd123 commented 2 years ago

ah, maybe it is stopping, and I have my restart policy on my container "Unless stopped". So docker is starting it backup since I didn't stop it, it crashes, and exits frigate from within the container.

Here is the log of what I see happening:

[2022-03-05 22:05:58] frigate.edgetpu                ERROR   : No EdgeTPU was detected. If you do not have a Coral device yet, you must configure CPU detectors.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/frigate/frigate/edgetpu.py", line 136, in run_detector
    object_detector = LocalObjectDetector(
  File "/opt/frigate/frigate/edgetpu.py", line 44, in __init__
    edge_tpu_delegate = load_delegate("libedgetpu.so.1.0", device_config)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1.0
[2022-03-05 22:06:02] frigate.watchdog               INFO    : Detection appears to have stopped. Exiting frigate...

Then it starts back up again because of my restart policy

blakeblackshear commented 2 years ago

The problem here is that falling back to CPU is a bad idea for many people because they simply don't have the CPU resources to run detection and frigate will consume all resources. There are some situations where this exit and restart of the container will pick up the Coral on the next startup. Frigate is not aware of the number of times it has exited and restarted due to this error. When do you stop trying to detect the Coral vs fallback to CPU or disable detection?

bartekd123 commented 2 years ago

yea, I hear that. Its a bit tough with it not knowing if it has been restarted. would it make sense to make it not stop the entire container when it notices coral is not available, and instead restart the service within the container say 5 times, and if that still doesn't resolve it, turn off detection?

Or maybe providing a boolean option within the config, that if at startup, it doesn't detect a coral, to just turn off recording (or fallback to CPU). That way you would have to set it to true in order for it to act that way.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kentkravitz commented 1 year ago

I feel like this should be fixed. If a tpu fails, having the ability to auto fallback to cpu or no detection seems like an important ability. If I'm relying on camera footage being recorded, I don't want to find my frigate instance rebooting for the whole day since a piece of hardware dangling off my machine crapped out.

Of course some may want frigate to fail without a tpu since it would spike their cpu, but again I think giving the choice to the user is the best of both worlds here.

Some quick feedback in the interface about how the detection is back on cpu would be icing on the cake.

Whytey commented 1 year ago

@blakeblackshear , is it possible to reconsider the behaviour here and reopen the bug?

I would like to see Frigate be able to continue to function even if the Coral device disappears in some way.

Overnight my neighbour was broken into. He reached out for me to check my recordings but unfortunately Frigate crashed yesterday evening since my Coral device locked up (not Frigate's problem AFAIK). If this had only resulted in detection being disabled I would have still had my 24x7 recordings to scrub through.

We could have an additional sensor relating to detector status that is exposed to Home Assistant which can be leveraged for automations to raise notifications etc.

(For reference, I run Frigate docker in an LXD/LXC container on Ubuntu and it has been rock solid until my Coral has locked up twice in the last month).

blakeblackshear commented 1 year ago

With the changes coming in 0.13, we could consider storing some kind of record of the number of restarts due to a failed edgetpu and fallback at some point. Previously there was no state to reference to know that it was in a reboot loop.

clearwave1 commented 9 months ago

Since you have closed my enhancement request (#8768), can you please provide an update here on when this will be addressed?

The constant restart seems to be a defect in my opinion. Even if the detector service stops working, my 7x24 recording could keep working without the TPU but I lose that as well due to the restarting. I also cannot view the cameras due to the restarting.

clearwave1 commented 9 months ago

It would also be helpful for some sort of notification (e.g. email or text) to be sent when this error is discovered so we don't lose multiple hours of footage before knowing there is an issue.

clearwave1 commented 6 months ago

Please provide an update on when this will be fixed or enhanced.

This is still a significant issue for me and it happens regularly. We should be able to keep the cameras working (displaying and recording) even if the detector fails. We should also be able to provide a detector backup in case the primary detector fails.

The current state where the container just keeps trying to restart is not workable.

NickM-27 commented 6 months ago

it is not at all normal for a detector to fail often like you are describing, I'd suggest looking in to why that is occurring. There is no known timeline for when this will be implemented

clearwave1 commented 6 months ago

I have a PCIe Coral TPU that Frigate stops seeing a few times per week. I can only get it back by performing a hardware restart of the server (which has many other applications). I have implemented a cron task to reboot the server every night at 3am which has reduced the occurrences but it still happens during the day once or twice per week.

It doesn't seem reasonable that the Frigate server restarts when the detector throws an exception. The camera display and recording functions should continue and a notification should be sent to the admin (me). The more flexible solution would be to allow a backup detector in the config.

NickM-27 commented 6 months ago

Right, and this feature is pinned and will be implemented at some point, but there hasn't been much need for this in general so it hasn't been requested by many users.

Backup detectors aren't easy because oftentimes different detectors don't have equivalent models that can be run so the detection would be confused by different scores and user filters would not apply the same way. Detect could be forcibly disabled and skipped but that is still suboptimal.

That behavior is a sign of issues and not something I've heard of many users seeing. I'd highly encourage looking further into why it is happening.

winstona commented 6 months ago

adding a similar but slightly different use case/scenario for a CPU fallback option:

I have frigate running in a kubernetes cluster - usually on a node with a GPU. If this node dies and the pod/container is rescheduled to a different instance it won't have a GPU available to use and appears to hit a similar restart loop.

Having a fallback to CPU would be helpful to continue running everything functionally on a different node until it is able to be rescheduled back to a GPU node.

winstona commented 6 months ago

I recently switched over from CPU -> GPU detection and have run into this issue multiple times recently when the node fails, leaving it in a restart loop for hours attempting to recover on a CPU only node (node crashes/failures seem to be happening significantly more often since switching to GPU, but that could be for other reasons)

I went ahead and took a stab at implementing something to handle CPU fallback, and so far it appears to be working well for my use case. It doesn't address all of the asks, but it's a start at least.

clearwave1 commented 6 months ago

This sounds interesting. Can you elaborate on what you have implemented? Is this a change in Frigate code or configuration or did you do something outside of Frigate (e.g. through Kubernetes)?

winstona commented 6 months ago

sorry for not making it more clear, @clearwave1 - it is implemented with a few changes in the frigate code, PR here: #10440 , however it looks like based on the feedback it may require a few additional changes to get worked into upstream code and won't be available until the next major release.

I did initially start with attempting to shim in a workaround on startup, but it started to look like it would be just as much work as modifying the code, so I eventually opted for the code route.

clearwave1 commented 6 months ago

Thanks for clarifying. I had a look at the PR and this will definitely help my situation where my Coral TPU randomly stops working until I physically restart the server.

Hopefully it can be moved into at least a beta release soon. Thanks for your work on this.

distante commented 5 months ago

I would think that a fallback would be to just disable Detection and show a warning on frigate page (or send an event to Home Assistant for example)

I had this happening to me for the first time in more than a Year using the coral TPU, the device is blocked and of course, I am away from home for a week... so now I will not have any security footage since Frigate keeps rebooting 😢

NickM-27 commented 5 months ago

that is probably a good first implementation until the details of a more complicated approach that allows detection to continue working

clearwave1 commented 2 months ago

Can you please provide an update on when this will be fixed? This is a defect and not a feature request and it needs to be fixed ASAP as it causes Frigate to go into a restart loop and nothing is captured and the cameras cannot even be viewed.

Since the last comments above, Frigate has gone through multiple major releases. Please implement something (e.g. just disable the detector as suggested above and send an alert) to fix this soon.

NickM-27 commented 2 months ago

Since the last comments above, Frigate has gone through multiple major releases

That is incorrect, the last comment was on March 30, the last major release (0.13.0) was on January 30.

NickM-27 commented 2 months ago

Even the simpler approach mentioned above requires considerable planning and thought to make it work well. Frigate supports many different types of detectors (TPU, NPU, GPU, etc.) and all of these behave differently, have different forms of errors, etc. This means we need to understand each of these and how the detector would go in and out of an error state.

The next problem is communicating this to the user. Not only would there need to be a status item in the UI denoting that detection is blocked due to detector error, but the detect mqtt and websocket switches also need to be aware of this and forcibly not enable detection even when the user requests it.

All of this requires planning and thought to work well, not have false positives that lock detection for incorrect reasons, etc.

clearwave1 commented 2 months ago

Thanks for your quick reply. I've been in software development almost 40 years so I appreciate the attention to detail. However, at times "analysis paralysis" creeps in and slows or stops progress. This is why the IT industry has embraced iterative development methodologies in the last decade or so.

In this case, I don't think you need to tackle all of the cases you mention above in the first iteration. An improvement would even be just the notification that the service is not running properly. If you don't want to get too specific to a type of detection, a simple service restart counter would highlight that there is some problem preventing full startup even if it wasn't clear what the problem was. This would allow manual intervention and investigation in a timely manner (if around the server).

For the user communication, if there was a status item that HA (or others) could read even when the service was restarting, this would probably be enough. The health monitor seems to run the restart logic so maybe this would be the logical place in the code but you would know better than I.

Finally, I hope you would agree that the live view and recording services not running just because the detection service cannot run is a defect. I'm not sure whether this is a programming choice or a design limitation but hopefully it can be fixed soon.

Thanks for considering my points of view. :)

blakeblackshear commented 2 months ago

We have all been working in software engineering for decades too. Every user thinks their issue is the most critical deal breaker. With 10s of thousands of users we rarely hear that this specific issue is the show stopper that you are making it out to be. They have their own list of show stoppers.

As a software developer with a long history in the industry, I am sure you can appreciate that software evolves over time and understanding the history is important. Frigate used to exclusively be an object detection processing pipeline. Without a detector, it did nothing. If I was starting over knowing everything I do now, different decisions would have been made earlier about the architecture.

Not saying this isn't a problem. Just that it isn't the most important problem when you consider feedback across the user base. It's always just a matter of prioritization.

There are already plenty of ways to be notified that Frigate isn't running properly via the Home Assistant integration, so all your short term fixes already exist.

I have Frigate systems that have been running for multiple years without having a detector fail in a way that prevents Frigate from recovering on it's own. This only applies to hardware failures of one very specific type.

clearwave1 commented 2 months ago

Thanks for your response and for the tip on the Home Assistant integration. I went hunting and found a Status entity that was hidden (it's disabled by default and not in the docs).

I enabled it and the current state is "running". Can you please provide what other state values could be expected in this entity?

If this has different states for running healthy, starting/restarting, down, etc., then you are correct that I can use these with an automation to at least be notified when there is an issue. It's not a fix but it is a helpful workaround for now.

Thanks.

blakeblackshear commented 2 months ago

When frigate is down, most of the sensors become unavailable since frigate leverages an availability topic. When an individual camera is down, the camera fps will drop to 0.

B0ndo2 commented 1 month ago

Right, and this feature is pinned and will be implemented at some point, but there hasn't been much need for this in general so it hasn't been requested by many users.

Backup detectors aren't easy because oftentimes different detectors don't have equivalent models that can be run so the detection would be confused by different scores and user filters would not apply the same way. Detect could be forcibly disabled and skipped but that is still suboptimal.

That behavior is a sign of issues and not something I've heard of many users seeing. I'd highly encourage looking further into why it is happening.

I am adding my vote for this feature. TPUs seem to be unstable and die without any reason. One functional loss doesn't mean the whole system goes down. I opened a request today and was closed as duplicate of this.

B0ndo2 commented 1 month ago

Thanks for your response and for the tip on the Home Assistant integration. I went hunting and found a Status entity that was hidden (it's disabled by default and not in the docs).

I enabled it and the current state is "running". Can you please provide what other state values could be expected in this entity?

If this has different states for running healthy, starting/restarting, down, etc., then you are correct that I can use these with an automation to at least be notified when there is an issue. It's not a fix but it is a helpful workaround for now.

Thanks.

Can you share more on what you enabled ?

clearwave1 commented 1 month ago

If you are running the Frigate HA integration you will have a device for each camera and a device for Frigate itself. Click on the Frigate device and you should see a page that shows some entities under the Diagnostic heading. Under that heading it will also have a link for "Entities not shown". If you click this link, it will show some additional entities that are disabled by default. One of them is "Status" and it can be enabled by selecting it and then use the gear icon on the window the pops up to enable it. The Status seems to either show "running" when things are good or "unavailable" when my TPU disappears so I have an automation that notifies me when it becomes unavailable. It's not the correct solution but better than losing many hours of video.

I still hope that the Frigate development team will fix this even though not many have reported it. I agree that all the services becoming completely unavailable (live video, capture, and detect) due to the TPU having an issue, is a defect or a design issue that should be fixed urgently regardless of the number of users reporting it.