Seeing the plugin not recover when a 500 timeout occurs

groovybits commented 2 months ago

I am seeing that when it takes more than 5ms to load the query from druid within the Grafana sql it has a 500 error then stops working till I reload the whole webpage. I see other built in Grafana plugins handling this and still working after a 500 error like that with a timeout. Is this something in the business-media plugin that isn't behaving the same? It seems to be triggered when other queries may combine to slow down druid so the response is longer than 5ms. Is there a way to increase that response, or is that inside of grafana?

groovybits commented 2 months ago

Here is a draft PR in a forked branch of mine where I am trying to fix it. I have not tested this but am working on doing so. I know it builds at least and am going to test it hopefully today.

https://github.com/VolkovLabs/business-media/pull/144

mikhail-vl commented 2 months ago

@groovybits We discussed with @vitPinchuk today this issue and was looking for a way to reproduce it. We have not seen such issues with Websocket, MQTT (streaming), JSON, and SQL data sources so far.

We are not familiar with Druid and it would help if you can provide us a test case with docker-compose configuration to reproduce locally.

Have you tried to increase retry and wait interval to see if it solve the timeout issue in the data source configuration?

All Grafana panels work with data source the same way and should produce errors when receiving them from the data source. Error can disappeared only after dashboard refresh (manual or automatic), when panel receive new data from the data source and re-renders or streaming, when data is constantly coming and re-render panels.

I looked at your PR and wrapping the panel in the try and except is not the correct way to handle such errors, because we won't receive new data to display it until refresh and panel re-render. We have to find root cause why it's happening and find a better approach to fix it.

groovybits commented 2 months ago

@groovybits We discussed with @vitPinchuk today this issue and was looking for a way to reproduce it. We have not seen such issues with Websocket, MQTT (streaming), JSON, and SQL data sources so far.

We are not familiar with Druid and it would help if you can provide us a test case with docker-compose configuration to reproduce locally.

I will attempt this, currently am using a basic druid image in my docker-compose and basic configuration.

I have 8 of the panels in one dashboard, each loading 1 image from druid using SQL query methods to get them.

Often if there are not 8 images within the last N minutes given of time to search, this will break the images. It seems like the Druid datasource does this pretty consistently and easily. It is a pain to setup druid though manually and know all the steps, so that is troublesome to get something reproducing it w/out druid being completely hard-configured for the data stream first.

Have you tried to increase retry and wait interval to see if it solve the timeout issue in the data source configuration?

Yes I actually have the timeout set high and it usually works now when it was constantly happening. Yet over 24 hours it sometimes happens still, and images never load again after that till a manual reload of the webpage.

I looked at your PR and wrapping the panel in the try and except is not the correct way to handle such errors, because we won't receive new data to display it until refresh and panel re-render. We have to find root cause why it's happening and find a better approach to fix it.

Oh so it won't continue on after this sort of catch and return of an empty null value?

That does sound painful to solve,

Is there some standard way grafana's native plugins are doing this (they seem to)? Or is there some more code before this that could handle such an issue, or does it go back into druid?

Grafana's native plugin's seem to always re-try even after failures, never show a broken panel like this but they have induced this image plugin to start showing them when they temporarily break (or they cause the druid instance to slow and timeout for the image panel basically behind the scenes, yet they recover from it).

groovybits commented 2 months ago

Also I'm curious if returning some new value of a static default image or something chosen as a fallback could work here? Like if the network goes away, and then comes back, how do we recover from any errors? Is the druid datasource at fault for the recovery issue possibly then?

Right now I see Grafana handling this in native panels with the same druid but in this panel it seems to not handle those issues and doesn't reload when the network is back up. Ideally it would in any situation not stop trying to get an image, and for our use case better when an image fails to display some default image would be nice (since they are frames from video in sequence so the specific failure would ideally have a place holder image that indicates that frame failed to load or all black etc).

mikhail-vl commented 2 months ago

@groovybits I have an idea what can be the issue and we will investigate. Having a test case with docker will help to test it.

@vitPinchuk Is it similar to https://github.com/VolkovLabs/business-forms/pull/490 ?

groovybits commented 2 months ago

Sounds great I will try to put a simple one together when I get some time.

Note that I did get this change applied and tested locally, which it allows me to turn wi-fi off then turn it back on and confirm they are loading after re-connecting. So that itself is great for us but I see how not the fix for everyone, will try to get some test case to allow re-recreation. I suspect just a druid that was brought down then brought back up would recreate the issue on any input image. It isn't that complex and looks like it is really somehow related to handling not getting anything back or the connection breaking to druid (seems like connection breaking).

mikhail-vl commented 2 months ago

@groovybits Could you please test the CI artifact to confirm that issue fixed before merging and release: https://github.com/VolkovLabs/business-media/actions/runs/10921077662

VolkovLabs / business-media

Seeing the plugin not recover when a 500 timeout occurs #143