Introduce a "ESH start level" functionality

tomhoefer commented 8 years ago

Hi all,

in our project we have a lot of event subscribers and registry change listeners implemented which are called during startup / shutdown of ESH as a matter of course. In shutdown phase these services will update their model accordingly which results in the problem that the model cannot be re-built after next startup (because the subscribers / listeners have been assumed that the item / thing / link has been really deleted). We should distinguish between event sending / listener notification for adding / removal of items / things / links during framework startup / shutdown phase and normal runtime.

For this reason I would like to disable in our project that events are sent / listeners are notified during startup / shutdown. I could imagine two ways to implement this:

Providing a RuntimeStateService that can be requested in order to get the information if the runtime is started. So for the beginning the service would only consist of a single operation boolean : isStarted() and it will be injected as dynamic dependency into the registries (things, items, links, rules). Then I would skip sending events / notification if the service is present and the runtime / framework is not fully started.
Each service that requires the runtime state has to implement a RuntimeStateListener interface which is tracked by a central RuntimeStateService to be provided by solutions based on ESH. As soon as the runtime state has changed to started the service will inform all listeners about this. Once the runtime left the started state again all listeners are informed about this. For the beginning I will implement the new RuntimeStateListener interface by the AbstractRegistry (concrete registries can decide if the runtime state listener is to be provided as a service).

I think that 2 is the only valid option to implement this requirement. In 1 the runtime state service could unregister too early so that the events are sent / listeners are notified again.

What do you think?

maggu2810 commented 8 years ago

How do you want to decide that the framework is started or shut down? If I update one Eclipse SmartHome bundle that triggers some restarts (deactivate, activate) of bundles and / or services. Which one is responsible for a whole "startup" / "shutdown" state?

Isn't this product specific which services needs to be available to signal "all present" / "normal runtime"?

tomhoefer commented 8 years ago

In our product we can rely on our OSGi runtime implementation that the framework is started properly. For solutions running e.g. on Equinox I thought to use a framework listener and listen on the FrameworkEvent.STARTED

maggu2810 commented 8 years ago

I have assumed you are talking of the Startup and Shutdown phase of the Eclipse SmartHome framework. But you refer to the start and stop of the OSGi Framework. Correct?

So, there are some options, using a FrameworkListener, using a SynchronousBundleListener for bundle 0 to handle on the stopping event, ...

Is the intention that the Eclipse SmartHome framework does not fire any event as long as the OSGi Framework is not fully started (starting up or shutting down)? But I assume we also need to react on restarts of special bundles / services etc. If a bundle is updated or services are restarted, which Framework or Bundle Events are triggered? The OSGi framework is still "started" but ESH bundles could disappear (but I am could be wrong, I never watched all the events).

Which one needs to be observed to differ between "normal runtime" and non-normal one?

tomhoefer commented 8 years ago

But you refer to the start and stop of the OSGi Framework. Correct?

Yes

Is the intention that the Eclipse SmartHome framework does not fire any event as long as the OSGi Framework is not fully started (starting up or shutting down)?

Yes

If a bundle is updated or services are restarted, which Framework or Bundle Events are triggered?

I think this depends on the used OSGi runtime. In our project we dont want to be informed if entities are added or removed during startup and shutdown. We have already a dedicated state that declares the framework as started.

Which one needs to be observed to differ between "normal runtime" and non-normal one?

Especially ItemAddedEvent, ItemRevomedEvent, ThingAddedEvent and ThingRemovedEvent

Can you give me an example for which bundle / service you think we need to react on its restart?

maggu2810 commented 8 years ago

Can you give me an example for which bundle / service you think we need to react on its restart?

No, not ATM. I need to think about the whole topic in more details.

You have written you are doing this already, so I assume you know that it is working and how it is working (the architecture). I don't. :wink: Give me some time.

tomhoefer commented 8 years ago

Haven´t yet started with the implementation ;) But because it is urgent I think that I will provide a PR in the following week

kaikreuzer commented 8 years ago

I agree that it isn't easy to say whether the system is up or not up. What does it mean if the OSGi framework keeps running, but ALL ESH bundles are fully stopped and restarted? I would consider this that ESH is NOT up - hence the feature should not about the OSGi framework, but about ESH itself.

"Up" means for me that certain services have started and are available. How can this be determined and others be notified about reaching (or leaving) this state?

I see several use cases of such a feature (from recent discussions):

avoid Item/Thing/etc added/removed events when the system is only started/stopped and hence only reconstructs the status quo from the last up-time. I have seen myself the log being cluttered on shutdown with 1000 "item removed" events, which clearly makes no sense. Usually, "item removed" should mean that it has been removed from the system and won't re-appear automatically again. This is the use case @tomhoefer describes above.
we recently introduced the XML processing vetoing (https://github.com/eclipse/smarthome/pull/1856) - this also just tries to make sure that a certain state (XMLs loaded) has been reached before starting other services (the thing handlers) (the tricky thing here might be that it is more fine-grained as it blocks single bundles depending on more detailed processing information)
Very frequently the right moment for the startup rules is discussed. So far, they are potentially triggered when not all items have been restored yet in the registry, which causes all kinds of problems. For this it would also helpful to be notified about some "system up" state, so that the rules can be safely executed.

sjsf commented 8 years ago

IMHO, a single state will not fit all of our needs. As @kaikreuzer pointed out, we e.g. have services that require other services to be up and running and fully loaded (whatever that means). Then again, there might be other services which depend on the previous ones to be started. So we will end up having several different levels of "active", like e.g. the start levels for bundles in OSGi. Additionally, the definition of these levels is going to differ for every solution built on top of ESH.

Generally, the introduction of such a framework state in a dynamic system usually is a workaround to cover up for maybe-not-so-ideal design decisions in other places. I would suggest to first look into the individual use-cases and see if we somehow can fix the root causes.

Regarding the Item/Thing/etc added/removed events, the root cause is that we cannot distinguish whether they were loaded or newly created (or removed/unloaded respectively). I'd suggest that we fix this and also let listeners/subscribers decide what kind of event/notification they actually require by either introducing new event types (i.e. ThingLoadedEvent, etc...) plus RegistryLoadListener interface, or amending the existing events and RegistryChangeListener with the corresponding information.

sjsf commented 7 years ago

Okay, it has been a while now... As we can see, there recently have been quite some topics which relate to this issue, therefore I'd like to get back to it now. I still think we should avoid using such a "startup level" construct wherever possible! But I have to admit that there are some use-cases which won't really work without it (e.g. related to the rule engine).

There recently was a blog post by @pkriens which addresses this very topic. And I think we could realize our requirements with exactly this idea, using the OSGi means for our purpose. The relevant services that we need to wait for (e.g. XML processing per binding, providers being up and running) would somehow denote that they are "finished" by registering a marker service into the SCR, carrying some defined properties.

Our "AggregateStateService" however must be configurable, as not all the services are available in every solution. Imagine there would be a solution without support for DSL based configuration, then it really does not make sense to wait for the GenericProviders to finish their loading. I'd suggest using config admin for that purpose.

As a first step, I would drop the BundleProcessorVetoManager and use such OSGi services to mark fully loaded bindings accordingly.

As a next step, I would create an AggregateStateService and make all relevant entities denote that they are finished loading. The idea would be that every service that somehow needs waiting (e.g. a SystemStartupTrigger) would create a dependency to such an aggregated state only, not to the services themselves. By that we would decouple the dependency from a concrete service into a configurable one with a semantic meaning. At the same time this allows us to define different levels of "readyness" of the system. Of course, we need to carefully define all the required properties and states, as they somehow become "API for solution providers", i.e. they must not be a big pain to maintain and should change as seldom as possible.

Does this make sense to you? Any thoughts on this?

pkriens commented 7 years ago

Aren't there any companies that can run this through OSGi? This is a very foundational service and it belongs somewhere low in the stack like Equinox or Apache Felix?

I could provide an initial implementation since I got it already running

tomhoefer commented 7 years ago

Is it possible to describe what this would mean for a solution that bases on ESH?

sjsf commented 7 years ago

Is it possible to describe what this would mean for a solution that bases on ESH?

As I don't have all the details figured out I'm not 100% sure yet. Ideally there would be no implications at all. I hope that in the startlevel configuration we can tie the services which need to be waited for to the existence of certain bundles, therefore providing a default configuration which should work ootb. Worst-case: a solution needs to maintain a configuration which lists all services which are required to be ready in order to reach a certain "startup level".

pkriens commented 7 years ago

I think you should not try to have a global perspective but think local. For each subsystem, what should be ready before that subsystem can start? These 'things' need to be made into services. However, these are all local decisions that only care about the local situation. It is the global view that is so often violated, resulting in bugs.

sjsf commented 7 years ago

I think we are on the same page here. I was referring to having one global configuration for all the local decisions. But just to be sure, let me sketch up a little example (please ignore syntax, naming etc. for the moment) to illustrate my understanding:

Let's assume we have a couple of services which announce their "readyness" by registering a ReadyMarker service carrying a certain identifier each:

component ThingManager
    provides
        ALL_THINGHANDLERS_INITIALIZED

component GenericThingProvider
    provides 
        GENERIC_THING_PROVIDER_LOADED

component ManagedThingProvider
    provides 
        MANAGED_THING_PROVIDER_LOADED

Now we have a component which technically requires those three components above to be "ready". However, as we need it to be configurable, we don't want it to directly depend in these ReadyMarkers directly, but define an abstraction to it:

level THINGS_READY_TO_USE
    requires
        GENERIC_THING_PROVIDER_LOADED (if bundle `o.e.sh.model.things` is present)
        MANAGED_THING_PROVIDER_LOADED
        ALL_THINGHANDLERS_INITIALIZED

This will be the configurable input for the AggregateStateService, which listens to all ReadyMarker registrations and if the configured list of ReadyMarkers shows up, it will register some kind of a LevelMarker (which technically could be the same ReadyMarker service just wich a property named differently), which our component then is allowed to depend on:

component StartupRuleTrigger 
    requires 
        level THINGS_READY_TO_USE

Does that make sense?

pkriens commented 7 years ago

Yes, this is exactly what the AggregateState service does ... However, some of these things look a bit like normal service dependencies? Which always have the preference.

jboeddeker commented 6 years ago

In my case, the Homematic-Channel-not-found-for-Datapoint-problems is also related to this.

https://community.openhab.org/t/homematic-binding-channel-not-found-for-datapoint-errors-for-definitely-existing-channels/26209

As soon as i removed the rules, the issue is no longer existing. With the rules in place, i have quite often 1 to 3 devices with this problem.

maggu2810 commented 6 years ago

It does perhaps not fit fully into the ESH start level concept, but I just played around a little with an idea and created a very simple demo.

See: https://github.com/maggu2810/shk/commit/6cd9e182369159410d3906fee6865a08cb023186

There is a SystemStateInjector service that could be used to inject information about a system state. The provider interface in the demo use a string key and an object value that is defined by the user, but there could surely be limitation if special functions are provided only.

There is a SystemStateProvider service that provides the system state information (set by the injector) that could be received by a member function. The applied states are also provided as service properties, so you could the DS service target filter to wait for a special state.

The ThingsReadyProvider should demonstrate a service that inject that the things are ready. It waits for a thing handler service and a thing registry service and inject the information.

The RulesReadyProvider acts similar to the ThingsReadyProvider.

There is a NeedRules service that should be activated only as soon as the rules are marked as ready. Using fields reference annotation, the only thing to do add are this two lines:

@Reference(target = "(rules=ready)")
protected SystemStateProvider ssp;

For 4.2 we could e.g. annotate a empty set function.

The service NeedRulesAndThings needs ready rules and ready things, the lines to use are:

@Reference(target = "(&(rules=ready)(things=ready))")
protected SystemStateProvider ssp;

For demonstration there is also a SystemStateMonitor that logs the updated properties.

After installing the two bundles in e.g. Karaf, you could use the scr:disable and scr:enable command to disable the thing handler, thing registry or rule registry service. Have a look at the look and see the result.

sjsf commented 6 years ago

As far as I can tell on the first glance this is pretty much along the lines of the basic idea we already discussed here. In your example it is indeed visible how nicely it would work having higher-level abstractions of "somethings" that need to be ready. That's what I tried with the ReadyMarkers, but on a too fine-grained level and got hit badly by the performance issue we spoke about in the other PR(s) and therefore for now turned this part into the ReadyService implementation we currently have. I still like the OSGi service dependency pattern much better though.

Nevertheless, what currently bothers me most (and I didn't really find the time yet to get my head around it) is the variance that the different ESH based solutions introduce. Let me explain it with a concrete example: Think about stuff like "all things need to be present" (things=ready) - some solutions have a GenericThingProvider, other's don't, and some have even additional ThingProviders. Ideally we would have a point in time where we can tell that all ThingProviders worth waiting for are known, so the ThingRegistry can start waiting for all of them becoming active and having finished to load their stuff and only then set the things=ready marker.

This however is so completely un-OSGi that I really think we should rather look closer on each individual case why we need all this in order to handle the dynamics and question this whole approach again and again...

pkriens commented 6 years ago

@sjka I think I share your view. The danger is that you start thinking global and that always falls apart in a component model. In general, you need to handle the dependency on the requirer side that has the actual knowledge of what it needs. I.e. a rule that need X should not be evaluated before X is present. This is much better than waiting to start the rule engine until all devices have started. You need to address these things where you have concrete information (like X.1) instead of trying to handle it global. Hope this helps.

maggu2810 commented 6 years ago

I considered things, rules, ... ready that the framework stuff is ready (thing handler could start doing its work, rules could be proceeded, etc.). Is waiting for "all things need to be present" e.g. to be ready to execute rules possible at all? Thing about a binding / thing handler, that is fully initialized itself, but needs an undefined time until it could detect its things (if they are online) and communicate with this one. Should the whole rule trigger "system started" wait for an undefined time? If a rule needs to access that things, perhaps is should be triggered by "thing online" instead.

What are the main "wait conditions" we need at all -- and which part should wait?

sjsf commented 6 years ago

What are the main "wait conditions" we need at all -- and which part should wait?

Looking at the tons of issues which are linked against this one, I'm about to say: pretty much everything 😉 But that's exactly why I'd like to avoid - as tempting as it is.

However, in the end I think it's mainly about the rule engine(s). The other cases need to be looked more deeper into, and hopefully can be solved locally.

In the rule engine(s), the major pain-point are the "system started" triggers - all other triggers won't be triggered or executed anyway, because the system simply is not "ready enough" to generate and/or receive such events (e.g. ItemStateChangeEvent), so no problem there.

The linked issues mostly refer to "items not present" because this is the most obvious error when the language model cannot infer item references - but as you pointed out, this won't be enough: Once the items are there, we will run into the next problem: the linked things (as well as the links themselves, obviously) also need to be there - otherwise the items can be nicely resolved but any sent command ends up in nirvana. Speaking about that, the corresponding ThingHandlers obviously also need to be finished initializing. If they end up being OFFLINE because they cannot reach their devices: tough luck, this might always happen.

In an ideal world, we could analyze the rule actions for the items which are referenced and wait for their things to become ONLINE/OFFLINE/UNKNOWN. This however seems pretty much impossible with more advanced, dynamic scripts where e.g. items are looked up dynamically from the ItemRegistry. And even if we overcome this problem by only considering hard-referenced items and build a 90% approximation, it might still be surprising to users if e.g. multiple items are changed in a rule but one will never become "useable" because the corresponding binding is missing. Why doesn't it execute it for the others? Can't the computer "know" that this binding is missing?

Is waiting for "all things need to be present" e.g. to be ready to execute rules possible at all?

This indeed is the key question! If we build something that isn't capable of solving this, then we won't win anything and don't even need to start.

jboeddeker commented 6 years ago

In the rule engine(s), the major pain-point are the "system started" triggers - all other triggers won't be triggered or executed anyway, because the system simply is not "ready enough" to generate and/or receive such events (e.g. ItemStateChangeEvent), so no problem there.

No, from my opinion it's not just "system started". More problems are created from the ItemStateChanged triggers triggered for example by the persistence engines. And some bindings take more time to initialize than others.

maggu2810 commented 6 years ago

More problems are created from the ItemStateChanged triggers triggered for example by the persistence engines.

Can you add more details? A persistence service can access the item registry on service activation and persist all non UnDefType.NULL states (WRT the discussion who is allowed to set the NULL state but that is currently mostly used by the framework on item creation only) to its storage. After it has been activated, it could store every item state change to the storage, too.

jboeddeker commented 6 years ago

Sorry, i think it was misunderstandable. It's not the persisting of items but restoring (strategy = restoreOnStartup) which causes the ItemChanged trigger to be fired. In my case this was a major problem, which was mainly solved when i excluded the change from Null from the trigger condition.

//Item someitem changed
Item someitem changed from X to Y or 
Item someitem changed from Y to X

This change removed much from the startup exceptions.

mherwege commented 6 years ago

I would add two more cases that could cause issues with rules when the system is started. I have seen all of these when starting openHab. A few restarts usually gets me over the problem, but that’s not very nice.

cron triggered rules, triggered when the system has not fully initialized all its items yet
a mix of items defined in items files and through pape UI: this can cause issues if one set is loaded, and the other set is not loaded yet. The rule could be triggered on the item from the loaded set, but still fail because it does not find another item referenced in the rule body. If this happens, the rule engine may generate a syntax error and never run the rule again.

maggu2810 commented 6 years ago

Should a rule be triggered at all if

items are used that are not available
items are available but not linked
items are available, linked, but thing has no handler assigned
items are available, linked, handler assigned, but thing is offline
...

Isn't the rule engine a special use case? I don't think that could be solved with a global "system is started and rules could be executed" state at all. Isn't it something that could be known by the rule writer only if the items need to have linked channels (and so things) or not, if the things does need already a handler or not, if the thing itself needs to be online or not, ...? Do you really think that every "user" wants the same stuff for the same usecase (especially WRT thing communication should be established)?

adimova commented 6 years ago

Should a rule be triggered at all if

I agree with @maggu2810, such rules should nod become IDLE. The problem is that currently the ModuleHandlers - which have the needed information - have no way to inform the RuleEngine of their state, and the changes in their state. I've proposed a solution in may comment in #4468.

kaikreuzer commented 6 years ago

@maggu2810 for this issue here, we are only talking about services that need to be fully started in the first place as a pre-condition to consider any kind of rule execution. Whatever might happen during normal operation time (items not there, things offline, whatever) is not relevant for this issue here, but is indeed something that needs to be handled in the appropriate components.

lolodomo commented 6 years ago

Bump 6 months later. Is there really no solution we could implement ? The different problems caused by rules started to much earlier is the most important issue in openHAB. Hopefully, it is not a blocking issue. Is there no way to add a setting to delay the startup of the rule engine ? With such a setting, I will delay the startup of 2 minutes and 99% of problems are solved.

maggu2810 commented 6 years ago

@lolodomo For ESH itself we need a clean solution.

For downstream project or at least for your setup at home you can delay the startup of the automation part easily by adding a bundle that does nothing than delay the automation activation. I tested a simple demo here that delays the bundle start:

Should be instantiated opened and closed by a bundle activator: https://github.com/maggu2810/shk/blob/delayed-start/bundles/shk-addon-delayed-automation-start/src/main/java/de/maggu2810/shk/addon/das/impl/automationcore/DelayedAutomationStart.java
Watch the bundle events: https://github.com/maggu2810/shk/blob/delayed-start/bundles/shk-addon-delayed-automation-start/src/main/java/de/maggu2810/shk/addon/das/impl/automationcore/BundleListenerImpl.java

You can improve it to start the delay as soon as e.g. smarthome core has been started, special services are available, ...

-- edit --

I improved the implementation to delay the activation of the automation bundle IF other service references are satisfied and stopping the bundle if that references are not available anymore. See e.g. https://github.com/maggu2810/shk/blob/delayed-start/bundles/shk-addon-delayed-automation-start/src/main/java/de/maggu2810/shk/addon/das/impl/automationcore/CheckAutomationRequirements.java if the thing registry and the item registry is available the automation core bundle will be started with a delay of 15 seconds, otherwise the automation bundle is stopped.

kaikreuzer commented 5 years ago

I just came across https://github.com/apache/felix/tree/trunk/systemready - this sounds like a very nice fit for our issue and probably worth to further investigate. @cschneider As you seem to be the main author of that project, please feel free to comment/advise here - if you do not think that it fits or that it is still in an too early phase, this would be a helpful input as well 😎.

cschneider commented 5 years ago

Systemready is still in an early stage. We currently mainly use it to report ready and alive for kubernetes. There is also a similar concept in sling called health checks. Last Wednesday I talked with the creators of this and we found quite a few things that should be added to systemready.

The main missing thing we found is having tags for system checks. Each tag could then represent one of the subsystems you talked about. This tags might then replace the ready and alive types. Other things are executing each check separately and failing it if it takes too long or blocks. I will create some issues on systemready. Any help with that is welcome. So I think systemready should be usable soon.

Generally for determining readiness it is not good enough to look at framework started or the fact that all bundles are started. Especially with declarative services a service might appear completely asynchronous from the bundle start. So a list of required services is the only stable way. Unfortunately we are having quite some difficulties creating and managing such a list for AEM. I wonder if a special annotation could help with that (like adding a tag to a service) that is then reflected in the Manifest.

I am not sure though if I would use this for switching on/off the internal eventing of esh. Maybe there is a different solution for this. How about having different events for a thing that really appears on the binding and a thing that is merely recreated because of a startup. In the same way when shutting down it should be clear if a thing is removed externally or just because of shutdown.

eclipse-archived / smarthome

Introduce a "ESH start level" functionality #1896