OptimistikSAS / OIBus

OIBus
European Union Public License 1.2
34 stars 17 forks source link

In case of sync protocol such as OPCUA / Modbus OIBus becomes a single point of failure. How to mitigate that? #171

Open marouanehassanioptimistik opened 5 years ago

kukukk commented 5 years ago

This may not be accurate, but this is how I see it.

SPOF can be caused by 2 things:

Problem with the application It can be caused by a crash, but also an update procedure can cause an outage. You have to decide what kind of uptime do you want to guarantee, whether an app update is acceptable or not. We could think about a solution to monitor the application and send a notification if it's down, so the IT administrator gets notified about the problem and intervene as soon as possible. Maybe offer some solution to integrate into company wide monitoring applications. We could also think about a patching solution for upgrading the application, so it don't have to be stopped, uninstalled, installed, started, and reduce the outage.

Problem with the hardware running the application To avoid this problem we should implement a redundancy solution, where 2 OIBus instances are running on 2 different hardware. They could continuously monitor each other, and when the primary instance is down the secondary could replace it.

jfhenon commented 4 years ago

osisoft doc: Hot, warm, and cold failover modes PreviousNext The failover mode specifies how the backup interface instance handles connecting to a data source and adding points when failover occurs. The sooner the backup interface can take over data collection, the less data is lost. However, increasing the failover level also increases data source load and system resource usage.

To determine which mode to use, consider how much data you can afford to lose and how much workload your system can handle. Be prepared to experiment, and consult your data source documentation and vendor as needed.

UniInt provides three levels of failover: cold, warm and hot. Higher ("hotter") levels preserve more data in the event of failover, but impose increasing workload on the system.

Hot failover Hot failover is the most resource-intensive mode. Both the primary and backup interface instances collect data. No data is lost during failover (unless both the primary and backup interface nodes fail together), but the data source carries a double workload.

Warm failover In a warm failover configuration, the backup interface does not actively collect data. The backup interface loads the list of PI points and waits to collect data until the primary interface fails or stops collecting data for any reason. If the backup interface assumes the role of primary, it starts collecting data. Some data loss can occur in a warm failover configuration.

Cold failover In cold failover, the backup instance does not connect with the data source or load the list of PI points until it becomes primary. This delay almost always causes some data loss but imposes no additional load on the data source. Cold failover is required for the following cases: A data source can support only one client. You are using redundant data sources and the backup data source cannot accept connections.

kukukk commented 4 years ago

Since we rewrote OPCUA, it is no longer a SPOF. However, we still have Modbus and MQTT.

These failover modes requires the redundancy solution I mentioned in my second description (for hardware failure).

Do you have any specific requirements from clients?

jfhenon commented 4 years ago

no, I start thinking about this and collect ideas but we did not decide working on this yet.

kukukk commented 4 years ago

If I remember correctly, the OPC HDA server at the client had limitation for the number of clients (if you killed OIBus, you had to wait a few minutes to be able to connect to it again). So, the cold failover could be a real use case.

The failover will require continuous interaction between the master and backup instances. I think, we can implement this interaction in such a way to support all 3 failover type:

It may require a small refactor at some South implementation to properly follow the same flow: connect to target server in connect() and get data in onScan()

It also requires a synchronization between the master and backup. I'm thinking about the lastCompletedAt value, but there can be other informations too.

kukukk commented 4 years ago

Any decision regarding this issue?

jfhenon commented 4 years ago

not yet. We wait for a customer case before engaging this. In the meantime, we should add some additional tests to the backend.