ausocean / cloud

GNU General Public License v3.0
1 stars 1 forks source link

tv: delay not long enough between hardwareStop and hardwareStart requests #313

Open trekhopton opened 1 month ago

trekhopton commented 1 month ago

The broadcast state machines are not allowing enough time for the hardware to power-cycle before retrying after failing to go live. This results in ineffective attempts to reboot the hardware and restart the stream. As you can see in the below example, a camera is powered off at 11:11:11, then powered on at 11:11:22, only 11 seconds later. We need to increase the delay or react to a hardwareOff event to ensure that the hardware has properly been turned off.

DEFAULT 2024-09-23T01:40:44.154058Z 2024/09/23 01:40:44 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: startFailedEvent
DEFAULT 2024-09-23T01:40:44.154063Z 2024/09/23 01:40:44 (name: KI TEST BROADCAST, id: F96HVRNctJI) storing event after cancel: startFailedEvent
DEFAULT 2024-09-23T01:41:03.994061Z 2024/09/23 01:41:03 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) got broadcast state machine; initial state: main.directStarting, start: 2024-09-22 23:30:00 +0000 UTC, end: 2024-09-23 07:30:00 +0000 UTC, cfg: (config logging disabled) Events: [startFailedEvent], HardwareState: hardwareOn
DEFAULT 2024-09-23T01:41:03.994078Z 2024/09/23 01:41:03 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing stored event: startFailedEvent
DEFAULT 2024-09-23T01:41:03.994082Z 2024/09/23 01:41:03 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: startFailedEvent
DEFAULT 2024-09-23T01:41:03.994086Z 2024/09/23 01:41:03 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) handling start failed event
DEFAULT 2024-09-23T01:41:04.874486Z 2024/09/23 01:41:04 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) transitioning from main.directStarting to main.directIdle
DEFAULT 2024-09-23T01:41:04.874495Z 2024/09/23 01:41:04 (name: KI TEST BROADCAST, id: F96HVRNctJI) stopping
DEFAULT 2024-09-23T01:41:08.976262Z 2024/09/23 01:41:08 (name: KI TEST BROADCAST, id: F96HVRNctJI) ID: F96HVRNctJI, requesting transition to complete status...
DEFAULT 2024-09-23T01:41:09.582152Z 2024/09/23 01:41:09 (name: KI TEST BROADCAST, id: F96HVRNctJI) could not stop broadcast on direct idle entry: could not complete broadcast: complete broadcast error: googleapi: Error 403: Invalid transition, invalidTransition
DEFAULT 2024-09-23T01:41:09.582162Z 2024/09/23 01:41:09 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: hardwareStopRequestEvent
DEFAULT 2024-09-23T01:41:10.233654Z 2024/09/23 01:41:10 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) handling hardware stop request event
DEFAULT 2024-09-23T01:41:10.877744Z 2024/09/23 01:41:10 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) transitioning from main.hardwareOn to main.hardwareStopping
DEFAULT 2024-09-23T01:41:10.877755Z 2024/09/23 01:41:10 (name: KI TEST BROADCAST, id: F96HVRNctJI) checking 44179313ed47.Power2 variable exists
DEFAULT 2024-09-23T01:41:11.071323Z 2024/09/23 01:41:11 (name: KI TEST BROADCAST, id: F96HVRNctJI) 44179313ed47.Power2 variable exists, setting to value: false
DEFAULT 2024-09-23T01:41:11.527178Z 2024/09/23 01:41:11 (name: KI TEST BROADCAST, id: F96HVRNctJI) checking 000009082308.mode variable exists
DEFAULT 2024-09-23T01:41:11.731116Z 2024/09/23 01:41:11 (name: KI TEST BROADCAST, id: F96HVRNctJI) 000009082308.mode variable exists, setting to value: Paused
DEFAULT 2024-09-23T01:41:14.649115Z 2024/09/23 01:41:14 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: timeEvent
DEFAULT 2024-09-23T01:41:14.649136Z 2024/09/23 01:41:14 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) handling time event: 2024-09-23 01:41:14.650222247 +0000 UTC m=+265869.613902727
DEFAULT 2024-09-23T01:41:14.649140Z 2024/09/23 01:41:14 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: startEvent
DEFAULT 2024-09-23T01:41:14.649143Z 2024/09/23 01:41:14 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) handling start event
DEFAULT 2024-09-23T01:41:15.281221Z 2024/09/23 01:41:15 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) transitioning from main.directIdle to main.directStarting
DEFAULT 2024-09-23T01:41:19.719619Z 2024/09/23 01:41:19 (name: KI TEST BROADCAST, id: F96HVRNctJI) today's broadcast has status: ready
DEFAULT 2024-09-23T01:41:19.719629Z 2024/09/23 01:41:19 (name: KI TEST BROADCAST, id: F96HVRNctJI) broadcast already exists with broadcastID: F96HVRNctJI, streamID: tnxOgHYhiDCCB0bFP94KKg1727055254729594
DEFAULT 2024-09-23T01:41:20.366467Z 2024/09/23 01:41:20 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: hardwareStartRequestEvent
DEFAULT 2024-09-23T01:41:21.014238Z 2024/09/23 01:41:21 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) handling hardware start request event
DEFAULT 2024-09-23T01:41:21.645428Z 2024/09/23 01:41:21 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) transitioning from main.hardwareStopping to main.hardwareStarting
DEFAULT 2024-09-23T01:41:22.045972Z 2024/09/23 01:41:22 (name: KI TEST BROADCAST, id: F96HVRNctJI) checking 44179313ed47.Power2 variable exists
DEFAULT 2024-09-23T01:41:22.240767Z 2024/09/23 01:41:22 (name: KI TEST BROADCAST, id: F96HVRNctJI) 44179313ed47.Power2 variable exists, setting to value: true
DEFAULT 2024-09-23T01:41:22.694273Z 2024/09/23 01:41:22 (name: KI TEST BROADCAST, id: F96HVRNctJI) checking 000009082308.mode variable exists
DEFAULT 2024-09-23T01:41:22.888209Z 2024/09/23 01:41:22 (name: KI TEST BROADCAST, id: F96HVRNctJI) 000009082308.mode variable exists, setting to value: Normal
DEFAULT 2024-09-23T01:41:23.364256Z 2024/09/23 01:41:23 (name: KI TEST BROADCAST, id: F96HVRNctJI) checking 000009082308.RTMPURL variable exists
DEFAULT 2024-09-23T01:41:23.559519Z 2024/09/23 01:41:23 (name: KI TEST BROADCAST, id: F96HVRNctJI) 000009082308.RTMPURL variable exists, setting to value: rtmp://a.rtmp.youtube.com/live2/m64w-134w-abva-d4a9-4y74
DEFAULT 2024-09-23T01:41:26.557514Z 2024/09/23 01:41:26 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) handling time event
DEFAULT 2024-09-23T01:41:26.557524Z 2024/09/23 01:41:26 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) checking status of device with mac: 151528200
DEFAULT 2024-09-23T01:41:26.948882Z 2024/09/23 01:41:26 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) status from DeviceIsUp check: true
DEFAULT 2024-09-23T01:41:26.948902Z 2024/09/23 01:41:26 (name: KI TEST BROADCAST, id: F96HVRNctJI) publishing event: hardwareStartedEvent
DEFAULT 2024-09-23T01:41:26.948905Z 2024/09/23 01:41:26 (name: KI TEST BROADCAST, id: F96HVRNctJI) (broadcast sm) handling hardware started event
DEFAULT 2024-09-23T01:41:26.949009Z 2024/09/23 01:41:26 (name: KI TEST BROADCAST, id: F96HVRNctJI) starting youtube broadcast object
DEFAULT 2024-09-23T01:41:27.582404Z 2024/09/23 01:41:27 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) handling hardware started event
DEFAULT 2024-09-23T01:41:28.207170Z 2024/09/23 01:41:28 (name: KI TEST BROADCAST, id: F96HVRNctJI) (hardware sm) transitioning from main.hardwareStarting to main.hardwareOn
DEFAULT 2024-09-23T01:41:28.807167Z 2024/09/23 01:41:28 (name: KI TEST BROADCAST, id: F96HVRNctJI) waiting for ready status...
saxon-milton commented 1 month ago

A simple solution (not sure if the best) would be to check the hardware state before we considering starting a broadcast e.g. if it's still hardwareStopping, don't start yet.

saxon-milton commented 3 weeks ago

I think the problem is that the hardware restart happens more as a side effect of the stop request from the failure and then the new start request from the next start event. The new start event handling is not aware of the previous failure, as such, this is not an explicit attempt to fix the start failure through a hardware restart.

I think the SM would be better if we had a failure state that is entered after this sort of failure occurs. It could then explicitly restart the hardware as a measure to fix things. We might need to think about this more though.

Some other solutions;

In the idle state, when we get a start request, we check that conditions make sense to actually start, like the hardware being off (which should really be the case for the idle state; which brings us to the next option).

Alternatively, we have a "directStopping" state which we stay in until we get a "hardwareStoppedEvent", for which we transition to "directIdle". This makes sense; it parallels the "directStarting" state. This would work for stopping in the case of failure, and then the hardware actually gets restarted.

trekhopton commented 3 weeks ago

Yeah, I agree it requires more thought to be a nice solution. I don't mind your idea of a directStopping state. I'll have some more thought about it but for now I think it's okay that hardwareStartRequests are ignored so that at least the hardware gets power cycled properly.

I don't really think it's a good idea to change the state of the hardware machine when it's in a transitional state like hardwareStopping, that's why I think it actually makes sense to ignore the request. Perhaps it could wait for it to turn off and then jump straight to hardwareStarting but that complicates things a bit.

saxon-milton commented 3 weeks ago

I guess I'm concerned about any side effects of ignoring the start request; do you understand what would happen ? Also, do all the unit tests still pass ?

Thinking about it now, I actually think having a directStopping state makes the most sense, and is a relatively simple fix right now. Even if we come up with some sort of direct broadcast failure state (similar to permanent broadcasts) I think directStopping would remain.

trekhopton commented 3 weeks ago

Yeah I don't mind the directStopping state but if we can do without it it may be simpler.

I think I do understand what happens when the hardware machine ignores the start request: The hsm will just continue stopping like normal and eventually transition to hwOff. The bsm will stay in directLiveStarting state until the state times out after 10 minutes, at which point it triggers a startFailedEvent and transitions to directIdle again. The timeEvent will then cause it to try starting again, hopefully this time with the hardware being off. Similar in vidforward and permanent starting states too.

One side effect that is not desirable is the first retry is almost always going to fail because it happens too quickly, which is why I think this needs a bit more thought. We shouldn't have to wait 10 minutes for the first successful retry. I'll write an issue to capture this.

trekhopton commented 3 weeks ago

The unit tests do still pass.

I realised it probably doesn't need another issue since it's still pretty much what I described in #313. I just modified the hardware machine first. The broadcast machine still sends a hardwareStartRequest too quickly.

Perhaps that was somewhat confusing since the issue sounds like I'm going to modify the broadcast machine to delay the startRequest but I ended up changing the hardwareMachine behaviour. Sorry about that. I still think the hardwareMachine behaviour was problematic in that it allowed a startRequest to interrupt the stopping state. I suppose I don't like that behaviour because I think if a stop request is received we should be able to guarantee that the hardware was stopped.

saxon-milton commented 3 weeks ago

I guess the question is, who is responsible for the timing of the start and stop requests? I think it's the BSM, not the HSM. Therefore, I don't think the HSM should ignore start requests in the hardwareStopping state; I don't think there's a technical reason why we have to. In which case, the most sensible thing IMO would be to have a directStopping state.

The reason why we have a directStarting state is because it's transitional i.e. we're waiting for things to happen to go to the live state. Similarly, stopping is transitional, we have to wait for the hardware to stop etc before we go to idle. If these transitions happened immediately then there'd no need, but they don't, so I think the most correct design is to have the transitional states for both starting and stopping.

Having said all this, given that you understand the side effects of this, and you think it will fix the retry mechanism in the meantime, i reckon it's OK for now, but I really think we should create a directStopping state and then stop ignorning start requests.