Open trekhopton opened 1 month ago
A simple solution (not sure if the best) would be to check the hardware state before we considering starting a broadcast e.g. if it's still hardwareStopping, don't start yet.
I think the problem is that the hardware restart happens more as a side effect of the stop request from the failure and then the new start request from the next start event. The new start event handling is not aware of the previous failure, as such, this is not an explicit attempt to fix the start failure through a hardware restart.
I think the SM would be better if we had a failure state that is entered after this sort of failure occurs. It could then explicitly restart the hardware as a measure to fix things. We might need to think about this more though.
Some other solutions;
In the idle state, when we get a start request, we check that conditions make sense to actually start, like the hardware being off (which should really be the case for the idle state; which brings us to the next option).
Alternatively, we have a "directStopping" state which we stay in until we get a "hardwareStoppedEvent", for which we transition to "directIdle". This makes sense; it parallels the "directStarting" state. This would work for stopping in the case of failure, and then the hardware actually gets restarted.
Yeah, I agree it requires more thought to be a nice solution. I don't mind your idea of a directStopping state. I'll have some more thought about it but for now I think it's okay that hardwareStartRequests are ignored so that at least the hardware gets power cycled properly.
I don't really think it's a good idea to change the state of the hardware machine when it's in a transitional state like hardwareStopping, that's why I think it actually makes sense to ignore the request. Perhaps it could wait for it to turn off and then jump straight to hardwareStarting but that complicates things a bit.
I guess I'm concerned about any side effects of ignoring the start request; do you understand what would happen ? Also, do all the unit tests still pass ?
Thinking about it now, I actually think having a directStopping state makes the most sense, and is a relatively simple fix right now. Even if we come up with some sort of direct broadcast failure state (similar to permanent broadcasts) I think directStopping would remain.
Yeah I don't mind the directStopping state but if we can do without it it may be simpler.
I think I do understand what happens when the hardware machine ignores the start request: The hsm will just continue stopping like normal and eventually transition to hwOff. The bsm will stay in directLiveStarting state until the state times out after 10 minutes, at which point it triggers a startFailedEvent and transitions to directIdle again. The timeEvent will then cause it to try starting again, hopefully this time with the hardware being off. Similar in vidforward and permanent starting states too.
One side effect that is not desirable is the first retry is almost always going to fail because it happens too quickly, which is why I think this needs a bit more thought. We shouldn't have to wait 10 minutes for the first successful retry. I'll write an issue to capture this.
The unit tests do still pass.
I realised it probably doesn't need another issue since it's still pretty much what I described in #313. I just modified the hardware machine first. The broadcast machine still sends a hardwareStartRequest too quickly.
Perhaps that was somewhat confusing since the issue sounds like I'm going to modify the broadcast machine to delay the startRequest but I ended up changing the hardwareMachine behaviour. Sorry about that. I still think the hardwareMachine behaviour was problematic in that it allowed a startRequest to interrupt the stopping state. I suppose I don't like that behaviour because I think if a stop request is received we should be able to guarantee that the hardware was stopped.
I guess the question is, who is responsible for the timing of the start and stop requests? I think it's the BSM, not the HSM. Therefore, I don't think the HSM should ignore start requests in the hardwareStopping state; I don't think there's a technical reason why we have to. In which case, the most sensible thing IMO would be to have a directStopping state.
The reason why we have a directStarting state is because it's transitional i.e. we're waiting for things to happen to go to the live state. Similarly, stopping is transitional, we have to wait for the hardware to stop etc before we go to idle. If these transitions happened immediately then there'd no need, but they don't, so I think the most correct design is to have the transitional states for both starting and stopping.
Having said all this, given that you understand the side effects of this, and you think it will fix the retry mechanism in the meantime, i reckon it's OK for now, but I really think we should create a directStopping state and then stop ignorning start requests.
The broadcast state machines are not allowing enough time for the hardware to power-cycle before retrying after failing to go live. This results in ineffective attempts to reboot the hardware and restart the stream. As you can see in the below example, a camera is powered off at 11:11:11, then powered on at 11:11:22, only 11 seconds later. We need to increase the delay or react to a hardwareOff event to ensure that the hardware has properly been turned off.