eclipse-archived / smarthome

Eclipse SmartHome™ project
https://www.eclipse.org/smarthome/
Eclipse Public License 2.0
862 stars 783 forks source link

[onewire] smartmeter binding is blocked while onewire handles missing sensors #6759

Closed ggzengel closed 5 years ago

ggzengel commented 5 years ago

I use OH2.4.0. I have seen this first with binding from OH2.4.0 and OH2.5.0 snapshot. Now I see this with the last build from @J-N-K jar.

I don't know if KNX or other bindings are blocked too but smartmeter is the binding with the highest frequency and showed connection timeouts https://github.com/openhab/openhab2-addons/issues/4462. The frequency from smartmeter drops from 2 sec to more than 90 secs per value.

This happens if some sensors are missing because one of my owservers are going down. The old onewire binding from OH2.3 never had this problem.

While this error the CPU load of OH goes down to nearly zero.

I hope everybody is aware that a binding can affect other binding in such a dimension.

J-N-K commented 5 years ago

Simple: by adding 20+ so non-existing sensors you use up the thread-pool for configuring these sensors. This is clearly not a „normal“ situation but wrong configuration from your side.

ggzengel commented 5 years ago

No, they are there and I made an update at one of my owservers.

J-N-K commented 5 years ago

Without touching your configuration or OH at that time?

ggzengel commented 5 years ago

Perhabs I did I saved the config. I do a lot of configuration and testing here.

J-N-K commented 5 years ago

Same then: that‘s a problem of the textual configuration that cannot be solved within a binding. The thing is re-created every time you touch the configuration, then the thinghandler needs to update the thing with information from the physical device.

For PaperUI defined things this information is stored within the database and only requested once when the thing is created.

ggzengel commented 5 years ago

Some time after a power failure there a some systems not rebooting properly. Does this mean if some owserver is missing and openhab is rebooting than openhab will be stuck?

ggzengel commented 5 years ago

Can you ping the bridge (read "/") before you take down the bridge after reading 5 missing sensors. Every time you take down the bridge all good sensors go to offline which is not necessary.

J-N-K commented 5 years ago

At least for some time. I could add a re-try counter, so this stops after five failed attempts or so.

ggzengel commented 5 years ago

First I would like to know what locks up OH? It's not CPU and not memory. So it looks like a shared resources. But which?

I think it would be better if you don't take down the bridge and use a long timeout for not existent sensors. On weak buses there a sensors which didn't answer for hours.

J-N-K commented 5 years ago

There is a thread pool of fixes size (IIRC five or so) for thing handlers. That means that of all tasks scheduled for exeution by thing handlers (all thing handler) a maximum of five are executed at the same time. If more than five tasks are scheduled, they are queued until a thread is available.

Each sensor property update is scheduled as task. However, since reading from the OwserverConnection is synchronized, these tasks execute one after another and not in parallel (i.e. one task tries to read from the OwserverConnection, while the other wait for the first to complete, then the next one is executed). It takes some time for each read to time-out if the sensor is not available. Therefore this may block further execution of other tasks.

The only solution would be another thread pool but this is clearly against coding guidelines, part D.

ggzengel commented 5 years ago

Can you put a mutex around ow communications and return blocked threads with "not-ready" or so?

ggzengel commented 5 years ago

That's programming like old days for a non-preemptive OS.

J-N-K commented 5 years ago

I didn’t make the guidelines. Feel free to request changes.

ggzengel commented 5 years ago

@kaikreuzer How to work around blocking openhab in this case?

ggzengel commented 5 years ago

I don't want to imagine what happens if I migrate my 900 16bit DMX channels from knxdmxd to openhab.

J-N-K commented 5 years ago

Why?

ggzengel commented 5 years ago

If there are only 5 threads for managing all.

ggzengel commented 5 years ago

Do you know about EHS channels for getting queue length, load and other resources?

J-N-K commented 5 years ago

I have put a limit on re-tries, so the block will resolve itself. However, the result is that this will not recover by itself. And please keep in mind that this only happens if you

And it fixes itself after the owserver becomes available again.

Regarding the DMX: one thread is used per universe, knxdmxd uses one for the sender and one for the crossfade processor.

And: I‘m not sure if five is the correct number. I didn‘t look up in the code, that was just quoted from memory.

ggzengel commented 5 years ago

Thanks I will try. Did you ping the bridge before taking it down?

ggzengel commented 5 years ago

It needs very long to settle down after restarting and it looks like it's always waiting for the bridge to come up. In this time it stucks for more than 30 secs. Even onewire didn't log something.

16:25:08.915 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:25:08.916 [DEBUG] [.internal.owserver.OwserverConnection] - failed requesting messageType READ, size 45, controlFlags 0x00000104, payload '28.5DBBBA030000/type'->return code -1, size 24, controlFlags 0x00000104, payload '' [I/O error: exception while reading packet - Read timed out]
16:25:08.917 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_04 failed, retry 4/5: I/O error: exception while reading packet - Read timed out
16:25:08.917 [DEBUG] [.internal.owserver.OwserverConnection] - opened OwServerConnection to 127.0.0.1:4304
16:25:08.918 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType READ, size 45, controlFlags 0x00000104, payload '28.BAEC9B030000/type'
16:25:09.167 [TRACE] [.internal.owserver.OwserverConnection] - read: return code -1, size 24, controlFlags 0x00000104, payload ''
16:25:12.171 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:25:12.172 [DEBUG] [.internal.owserver.OwserverConnection] - failed requesting messageType READ, size 45, controlFlags 0x00000104, payload '28.BAEC9B030000/type'->return code -1, size 24, controlFlags 0x00000104, payload '' [I/O error: exception while reading packet - Read timed out]
16:25:12.173 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_15 failed, retry 4/5: I/O error: exception while reading packet - Read timed out
16:25:12.173 [DEBUG] [.internal.owserver.OwserverConnection] - opened OwServerConnection to 127.0.0.1:4304
16:25:12.174 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType READ, size 45, controlFlags 0x00000104, payload '28.A777BC030000/type'
16:25:12.416 [TRACE] [.internal.owserver.OwserverConnection] - read: return code -1, size 24, controlFlags 0x00000104, payload ''
16:25:15.421 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:25:15.422 [DEBUG] [.internal.owserver.OwserverConnection] - failed requesting messageType READ, size 45, controlFlags 0x00000104, payload '28.A777BC030000/type'->return code -1, size 24, controlFlags 0x00000104, payload '' [I/O error: exception while reading packet - Read timed out]
16:25:15.422 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_18 failed, retry 4/5: I/O error: exception while reading packet - Read timed out
16:25:15.423 [DEBUG] [.internal.owserver.OwserverConnection] - opened OwServerConnection to 127.0.0.1:4304
16:25:15.423 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType PRESENT, size 40, controlFlags 0x00000124, payload '26.D1B17F010000'
16:25:15.664 [TRACE] [.internal.owserver.OwserverConnection] - read: return code 0, size 32, controlFlags 0x00000124, payload '&ѱ????�'
16:25:15.664 [TRACE] [.internal.owserver.OwserverConnection] - presence 26.D1B17F010000 : ON
16:25:15.665 [TRACE] [.internal.owserver.OwserverConnection] - owServerConnection already open, skipping input buffer
16:25:15.665 [TRACE] [ndler.AdvancedMultisensorThingHandler] - refreshing analog onewire:bms:server1:Ow_BMS_KG_DG_HR
16:25:15.666 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType READ, size 45, controlFlags 0x00000104, payload '28.D377BC030000/type'
16:25:15.665 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - 'onewire:bms:server1:Ow_BMS_KG_DG_HR' changed from UNKNOWN to ONLINE
16:25:15.904 [TRACE] [.internal.owserver.OwserverConnection] - read: return code -1, size 24, controlFlags 0x00000104, payload ''
16:25:58.963 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:25:58.964 [DEBUG] [.internal.owserver.OwserverConnection] - failed requesting messageType READ, size 45, controlFlags 0x00000104, payload '28.D377BC030000/type'->return code -1, size 24, controlFlags 0x00000104, payload '' [I/O error: exception while reading packet - Read timed out]
16:25:58.985 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_07 failed, retry 4/5: I/O error: exception while reading packet - Read timed out
16:25:58.986 [DEBUG] [.internal.owserver.OwserverConnection] - opened OwServerConnection to 127.0.0.1:4304
16:25:58.986 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType READ, size 53, controlFlags 0x00000104, payload '/26.D1B17F010000/temperature'
ggzengel commented 5 years ago

It will add non present sensors to refresh queue:

6:37:23.254 [TRACE] [.internal.handler.OwBaseBridgeHandler] - getting handler for Temperature sensor (70 to go)
16:37:23.255 [TRACE] [.internal.handler.OwBaseBridgeHandler] - Temperature sensor initialized, refreshing
16:37:23.255 [TRACE] [e.internal.handler.OwBaseThingHandler] - refreshing onewire:temperature:server1:Ow_T_KG_SP_PF_01
16:37:23.255 [TRACE] [.internal.owserver.OwserverConnection] - owServerConnection already open, skipping input buffer
16:37:23.257 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType PRESENT, size 40, controlFlags 0x00000124, payload '28.516BBA030000'
16:37:23.487 [TRACE] [.internal.owserver.OwserverConnection] - read: return code -1, size 24, controlFlags 0x00000124, payload ''
16:37:26.489 [TRACE] [.internal.owserver.OwserverConnection] - timeout - setting error code to -1
16:37:26.489 [TRACE] [.internal.owserver.OwserverConnection] - closing connection because persistence was denied
16:37:26.490 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:37:26.491 [TRACE] [.internal.owserver.OwserverConnection] - presence 28.516BBA030000 : OFF
16:37:26.491 [TRACE] [e.internal.handler.OwBaseThingHandler] - sensor not present

16:39:11.651 [TRACE] [.internal.owserver.OwserverConnection] - owServerConnection already open, skipping input buffer
16:39:11.652 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType READ, size 45, controlFlags 0x00000104, payload '28.516BBA030000/type'
16:39:11.896 [TRACE] [.internal.owserver.OwserverConnection] - read: return code -1, size 24, controlFlags 0x00000104, payload ''
16:39:14.900 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:39:14.901 [DEBUG] [.internal.owserver.OwserverConnection] - failed requesting messageType READ, size 45, controlFlags 0x00000104, payload '28.516BBA030000/type'->return code -1, size 24, controlFlags 0x00000104, payload '' [I/O error: exception while reading packet - Read timed out]
16:39:14.901 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_01 failed, retry 1/5: I/O error: exception while reading packet - Read timed out
16:39:14.902 [DEBUG] [.internal.owserver.OwserverConnection] - opened OwServerConnection to 127.0.0.1:4304

16:39:24.147 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_01 failed, retry 2/5: I/O error: exception while reading packet - Read timed out

16:39:32.439 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_01 failed, retry 3/5: I/O error: exception while reading packet - Read timed out

16:39:41.353 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_01 failed, retry 4/5: I/O error: exception while reading packet - Read timed out

16:39:50.204 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_01 failed, retry 5/5: I/O error: exception while reading packet - Read timed out

16:39:50.204 [DEBUG] [.internal.owserver.OwserverConnection] - failed requesting messageType READ, size 45, controlFlags 0x00000104, payload '28.516BBA030000/type'->return code -1, size 24, controlFlags 0x00000104, payload '' [I/O error: exception while reading packet - Read timed out]
16:39:50.204 [DEBUG] [e.internal.handler.OwBaseThingHandler] - updating thing properties for onewire:temperature:server1:Ow_T_KG_SP_PF_01 failed, retry 5/5: I/O error: exception while reading packet - Read timed out
16:39:50.204 [DEBUG] [.internal.owserver.OwserverConnection] - opened OwServerConnection to 127.0.0.1:4304
16:39:50.205 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - 'onewire:temperature:server1:Ow_T_KG_SP_PF_01' changed from INITIALIZING to OFFLINE (COMMUNICATION_ERROR): max. number of retries for thing update exceeded

16:40:13.238 [TRACE] [.internal.handler.OwBaseBridgeHandler] - getting handler for Temperature sensor (51 to go)
16:40:13.238 [TRACE] [.internal.handler.OwBaseBridgeHandler] - Temperature sensor initialized, refreshing
16:40:13.239 [TRACE] [e.internal.handler.OwBaseThingHandler] - refreshing onewire:temperature:server1:Ow_T_KG_SP_PF_01
16:40:13.239 [TRACE] [.internal.owserver.OwserverConnection] - owServerConnection already open, skipping input buffer
16:40:13.239 [TRACE] [.internal.owserver.OwserverConnection] - wrote: messageType PRESENT, size 40, controlFlags 0x00000124, payload '28.516BBA030000'
16:40:13.476 [TRACE] [.internal.owserver.OwserverConnection] - read: return code -1, size 24, controlFlags 0x00000124, payload ''
16:40:13.922 [INFO ] [smarthome.event.ItemStateChangedEvent] - EHZ_MAIN_SUM_O changed from 24761.165 to 24761.1653
16:40:13.924 [INFO ] [smarthome.event.ItemStateChangedEvent] - EHZ_MAIN_P changed from 693.4000000000001 to 692.1
16:40:14.649 [INFO ] [smarthome.event.ItemStateChangedEvent] - EHZ_WP_SUM changed from 51755.6281 to 51755.628800000004
16:40:14.650 [INFO ] [smarthome.event.ItemStateChangedEvent] - EHZ_WP_P changed from 1109 to 1118
16:40:15.478 [INFO ] [smarthome.event.ItemStateChangedEvent] - EHZ_MAIN_SUM_O changed from 24761.1653 to 24761.1656
16:40:15.480 [INFO ] [smarthome.event.ItemStateChangedEvent] - EHZ_MAIN_P changed from 692.1 to 691.1
16:40:16.058 [INFO ] [smarthome.event.ItemStateChangedEvent] - Move_EG_Buero changed from ON to OFF
16:40:16.479 [TRACE] [.internal.owserver.OwserverConnection] - timeout - setting error code to -1
16:40:16.480 [TRACE] [.internal.owserver.OwserverConnection] - closing connection because persistence was denied
16:40:16.480 [DEBUG] [.internal.owserver.OwserverConnection] - closed connection
16:40:16.481 [TRACE] [.internal.owserver.OwserverConnection] - presence 28.516BBA030000 : OFF
16:40:16.481 [TRACE] [e.internal.handler.OwBaseThingHandler] - sensor not present
16:40:16.482 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - 'onewire:temperature:server1:Ow_T_KG_SP_PF_01' changed from OFFLINE (COMMUNICATION_ERROR): max. number of retries for thing update exceeded to OFFLINE (COMMUNICATION_ERROR): slave missing
kaikreuzer commented 5 years ago

@kaikreuzer How to work around blocking openhab in this case?

First of all: The size is not hard-coded, it can be configured by setting

org.eclipse.smarthome.threadpool:thingHandler=<size>

in conf/services/runtime.cfg.

But clearly, it does not make much sense to have 100 threads running and waiting for connections/responses/timeouts from 1wire sensors, so increasing the thread pool size rather feels like a workaround than a proper solution.

In general, jobs that are executed in the thread pool should not block - at least not, if many of them are scheduled concurrently. Is there any way to refactor the binding to reactive behavior? I.e. to do asynchronous calls and have a single job accepting the responses and dispatching them?

J-N-K commented 5 years ago

I think I can refactor it to use the regular update-thread. That way only one thread will be blocked. Thanks for the suggestion, Kai.

kaikreuzer commented 5 years ago

Sounds like a idea to do so 👍.

ggzengel commented 5 years ago

@J-N-K I just had an idea. Why not opening multiple channels to one owserver? 1 init sensors 2 good sensors 3 missing sensors

J-N-K commented 5 years ago

Having more than one connection will make things unnecessary complex. The connection is only is reset if five consecutive reads fail, i.e. if five sensors are not present AND no read in between was successfull.

I have fixed the blocking here.