bitfocus / companion-module-thelightingcontroller

MIT License
4 stars 1 forks source link

Companion crashing LightingController #5

Closed tacvideo closed 1 year ago

tacvideo commented 1 year ago

Been having a lot of stability issues with LightingController for a while now, I've always just assumed it was LightingController (LC), even had them helping me out with stabilizing that environment. But I've just noticed over the past few weeks that if I turn off the Connection in Companion, LC doesn't crash.

It'll happen randomly, like middle of the night with nobody even in the building. But over the past few weeks I've disabled the connection when not using the system, then like this past Sunday morning, I enabled the connection and LC crashed 3 times in 30 minutes, then luckily was stable till end of service.

I have no idea where to look for logs (on the LC side) that might help pin point anything... I have been collecting the usual Apple Crash logs, but the LightingController guys (Chauvet) said they were not helpful.

We were on an old 2012 Mac Mini, but recently I put an M1 with latest OS in place to see if that made any difference - it didn't.

I'd really like to stay on LC, but at this point I'm under pressure to replace it which is a ton of work!

thargy commented 1 year ago

I am not associated with Chauvet, Showtec or the Lighting Controller in any way; I developed this module for use with the system my church uses, and shared it with the community. Sadly, the last time I looked at this was in March 2021.

However, have you seen this issue? The details are a little vague, but if I recall correctly the latest version of the Lighting Controller (aka QuickDMX) notifies you of all button changes, even if those buttons are on pages that are purportedly not visible to external apps. That was causing my code to trigger a refresh to update its db of available buttons, (as it had identified something had changed) which was causing the spamming of refreshes. I fixed that issue, by preventing the auto-refresh behaviour, and I notified LC to update their software, which I don't know if they did. However, the recommendation was to ensure you are exposing all your button pages to prevent the receipt of unknown button changes.

The other issue is that the WebAPI exposed by their software is synchronous and blocking (which is really poor design, it really should not interfere with the UI thread) and so lots of traffic can slow down the lighting controller, and cause it to block for periods (though it hasn't really crashed that much for me on PC and I have 10s of pages and 100s of buttons).

I recommend you ensure you have Companion is set to log debug messages and then see if there's anything being reported in the companion logs. If those logs are getting spammed it could be that you have way too much traffic going on and the WebAPI is blocking. The module does output plenty of useful debug logs which might help us figure out what's going on, so feel free to add some logs to the issue and I'll take a look.

tacvideo commented 1 year ago

Thanks for that detailed explanation @thargy. I have cleaned up all unused tabs and pages and made sure those I kept were "made visible", hopefully this does the trick as LC did crash during service yesterday and again after (I didn't have time to clean up before service), but the Debug log didn't show any signs of "spamming" - see 2022-11-20T20:04:33.001Z (we're on MST -7 hours) in attached logs...

TheLightingController-2022-11-20-130432.txt VideoMaster-iMac_companion_log_20221120-1306.csv

thargy commented 1 year ago

OK, so a quick review of your logs shows that you are refreshing the interface every second. That means that Companion is querying QuickDMX once a second to get a full list of the pages & buttons, which is the most intensive operation, e.g.:

2022-11-20T20:04:23.186Z | instance(DMX) | debug | API: No interface changes found.
2022-11-20T20:04:24.194Z | instance(DMX) | debug | API: No interface changes found.

Technically, QuickDMX should tell Companion when the interface changes, (via a 'INTERFACE_CHANGE' command) but it used to be unreliable, hence I added the ability to poll for changes. Further, I used to trigger a refresh whenever I received information about an unknown button/page (which was the functionality I removed in #2 due to that API bug that returned info about hidden pages). When I was initially writing the software I didn't know about the blocking, and I wasn't using lots of pages and buttons, so I didn't identify the problems that it could cause. The auto-refresh is handy when making lots of changes in QuickDMX that you want to appear in Companion quickly, and the auto-refresh + refresh on unknown buttons, made it super responsive during testing. However, once you've set up your buttons and pages in QuickDMX, it's unnecessary.

As mentioned above, the API is blocking, and returning the full interface definition takes QuickDMX some time. Hitting it every second when there are a lot of pages or buttons is likely to block the UI indefinitely, or at least enough for it to crash due to some other bug.

It is clear from your logs that it is QuickDMX that is crashing, rather than Companion, and I could see the example where my auto-reconnect code was working perfectly whilst QuickDMX was restarted (when a connection disconnect is detected it regularly polls until a connection is re-established). Note, this isn't a bug with Companion or my module, the API should run on a background, and you shouldn't be able to crash QuickDMX via that API, for obvious reasons, particularly when I only ever send valid commands to the API.

This is probably because you've set the refreshInterval setting to 1000 (i.e. 1s). That is the minimum value for that setting, and it is that high, specifically to prevent spamming QuickDMX. The default is 10000 (i.e. 10s), and this shouldn't cause problems. Further, you can blank the value which should stop it auto-refreshing altogether, in which case it will only update the pages/definitions when you restart either QuickDMX or Companion or when QuickDMX tells Companion via the 'INTERFACE_CHANGE' command.

Bear in mind, that you only need to refresh the interface when you make changes to buttons/pages in QuickDMX - which you probably don't do very often, so having a blank refreshInterval, or a much higher value (up to 10000000, ~2 hrs 46m) will reduce the load significantly and probably stop the crashes entirely.

Also, there's a 'Refresh Interval' action which you can bind to a button to manually trigger a refresh instead, which is a good idea when combined with no auto-refresh (blank refreshInterval). It is possible that the INTERFACE_CHANGE command is now working reliably, but I haven't tested that.

Once you blank the refresh interval, the above logs should no longer appear (unless you trigger a manual refresh); and I suspect you will have far fewer issues with QuickDMX.

If you have loads of pages, hiding them from the WebAPI will speed up interface refreshes, but it doesn't prevent unknown button changes being received. As I've removed the auto-refresh in that scenario, then that is no longer a problem; so you may find it useful to again hide pages you don't want exposed via the WebAPI.

TL;DR

I hope that helps!

tacvideo commented 1 year ago

Don't remember why I have refreshInterval, possibly from an old version of Companion. I do want to see the status of LC buttons (it's one reason I don't want to change lighting software) but I just cleared refreshInterval and see that I still see the status of lighting button changes, so thats awesome!

Hoping that solves the crashes!!

Thanks for your help!

thargy commented 1 year ago

I do want to see the status of LC buttons

Yes, changes are not based on refresh polling. The Refresh allows Companion to get an updated list of all available button, pages, faders, etc. and info about them (e.g colour, text, location, etc.), info that isn't passed every time a button is pressed/released, or a fader changes.

Actual button/fader changes are pushed from LC to Companion, as soon as they happen. However, for Companion to make use of those events it needs to have all the info from the full refresh - which is something it does whenever it connects, and, optionally, via polling.

Unless you add/remove/move/edit buttons, pages or faders you don't need to do a refresh. Once you set up your lights you rarely make changes that require a refresh; and if you do, you can add an action to a button to do a refresh manually anyway.

However, if you do move a button or page then button changes will be incorrectly mapped until you do a refresh. That's because, unlike button presses etc. LC doesn't reliably tell you about the change. There is a message that should tell you, and maybe it works these days, but when I built the module I wasn't getting the message in testing.

I hope that clears up the confusion.

I will close this issue for now. Feel free to let me know if you continue to have problems.

tacvideo commented 1 year ago

@thargy , Unfortunately after being stable all week cleaning up unused buttons/pages, and turning off the polling, we had another crash 3 minutes into the message - at 2022-11-27T17:38:41.846Z in attached logs - only thing the operator was doing was triggering slides in PP7 with "Next" and/or using USK on the ATEM to put the slides/scriptures up.

It was 3 minutes after the last DMX command was sent, a button in Companion that Transitions (from a Video/Picture) to a Live scene, which had been used at least twice before during the service. The logs appear "clean" so I'm at my wits end here...

VideoMaster-iMac_companion_log_20221127-1039.csv VideoMaster-iMac_companion-config_20221127-1040.zip

image
tacvideo commented 1 year ago

Is there anyway to see logs on the LightingController machine, like incoming API requests? It is a closed/secure network so there shouldn't be any nefarious traffic but who knows, we crashed last Sunday at pretty much the same point in the service - about 3 minutes into the message - this is a mighty big coincidence...

thargy commented 1 year ago

I've reviewed the companion logs, and it's good to see that it isn't polling, which is what we expect. Indeed, the last message sent from Companion was some minutes before (as you note); and the instance auto-recovered correctly when you restarted TLC. As such, the problem is unrelated to Companion or this module.

Is there anyway to see logs on the LightingController machine, like incoming API requests? It is a closed/secure network so there shouldn't be any nefarious traffic but who knows, we crashed last Sunday at pretty much the same point in the service - about 3 minutes into the message - this is a mighty big coincidence...

Pretty sure it'll be a coincidence, but if you're worried you can change the TCP in TLC (File->Preferences->Network), and you can change the External Control password in 'Live Settings'. You can then update both these settings on the module.

Companion communicates using TCP over the specified port (default 7348). You can get software to intercept those ports depending on your Operating System, and record what's happening, but again, I think it's pretty unlikely to be the problem.

As I mentioned previously, it's not the best designed software, and I too have experienced crashes (though maybe only once a year on Windows). In my experience, it's mostly been related to the USB drivers, or the device becoming unresponsive and the software then crashing. Ultimately, I wrote the module to do it's best to recover if TLC does crash on you, and I use the API in accordance with the spec (and what their own app does). All I can suggest is you contact their support again and see if you can get a crash log that they can use. As I mentioned, I'm not associated with them in any way, and they've never been particularly responsive to support queries and bugs I've raised.

The Companion team is the process of producing version 3 and aren't accepting new features for existing modules. As a result, I have started looking at migrating the module for version 3. It's a lot of work and, unfortunately, I am in the process of moving house so unlikely to get a huge amount of time to do it. Sadly, I don't think that will help this scenario as I don't believe the module is doing anything wrong or causing the crash, this appears to be a TLC bug.

All I can suggest is maybe try the latest beta version of TLC. 9.3.4 BETA. It was released on 10 November, and the comment says 'external con improvement - New method to manage external messages.' That sounds quite interesting and possibly relevant - see this comment. Failing that try running TLC on a separate machine (and potentially a different OS).

I have asked support if there are any material changes in the new beta here.

tacvideo commented 1 year ago

Thanks for your reply, I agree on all points. Not sure I mentioned this before but when it was crashing frequently I noticed that when I disabled the instance/connection in Companion the crashing stopped, however that was before I cleaned up the pages/buttons and turned off the polling etc...

I did already change the TLC remote password today and was thinking of changing the port but got distracted by a staff meeting, but if it is nefarious activity hitting the machine they most likely have a port scanner to pickup which ports to hit, so this seems less effective. I do wish Mac's network firewall was more configurable like Windows, but even though I'm a MCSE I prefer MacOS for this environment, the problem is I have been unable to replicate the crashing!

Anyway, that TLC Beta news is intriguing, I shall give that a try. I did post the problem to that same board a couple hours after your post, hopefully we get a response to either soon.

tacvideo commented 1 year ago

@thargy how do I use that Beta, do I just copy it to the same location as the regular App and launch it?

thargy commented 1 year ago

@thargy how do I use that Beta, do I just copy it to the same location as the regular App and launch it?

Probably. I suspect it's similar to these instructions.

Just make sure you've installed the latest full release first 👍🏻.

Also, it is a BETA meaning it's possible it has bugs, and I can't attest to its effectiveness. However, the response to my query means it's worth a try. Make sure you leave yourself time to test and revert the version if you have problems.

tacvideo commented 1 year ago

BTW they told me to try the new Beta as well... https://thelightingcontroller.com/viewtopic.php?f=47&t=7579

tacvideo commented 1 year ago

FYI 9.3.4 BETA is not stable, crashes more frequently.