Closed jpkelly closed 5 years ago
This is right around the corner, but it will for now require that you clone the repository and follow the developer guide, and not start it via the normal prebuilt packages.
Did you make this @haakonnessjoen ?
There is a alpha version in the source repo: headless.js. But currently it isn't that straightforward to use for the end user, since you would need to install node, and run the update scripts before you can use it. So I think we need to do something clever before we close this, and also more testing.
What would be the easiest way to import the configuration for the headless mode instance?
On raspberry pi? Try cp ~/.config/companion/db ~/companion/ if I don't remember wrong.
The most user friendly way would be to start it normally, export the full config as file, and then run headless and import it again, using the gui.
Getting the headless mode to run requires:
After all that I got the headless.js running. With osc data updates coming in every 30ms from clock-8001 the update rate on buttons with variables lags 1-4 seconds behind and after about 1,5 minutes something just gives up and the stream deck goes all black.
In the all black state pressing a button on the stream deck causes it to return to the logo display after a small delay.
The companion code continues to run and still prints osc updates and button gfx generation messages into the console but everything is just dead on the stream deck side.
Deleting all but the depili-clock-8001 module seems to improve the stability considerably. Now at over 6 minutes and counting where as before it never stayed stable for 2 minutes. The updates still lag behind and arrive in clumps with the display often skipping some second display updates.
People have reported that installing the raspian lite image makes usb stable with stream deck using headless.js. Not sure about latency though.
It seems that the memory needs are really marginal on the pi. After the module deletion and running only one module and 3 pages of buttons, of which 2 are updating from variables, one showing seconds and one minutes, the free memory on the pi hovers at around 70-65M running with rasbian lite.
Edit: the stream deck updating stopped at 20min instead of 2 min, but still not usable
I don't like that when the updates cease and the stream deck goes black no error what so ever gets reported on the console nor does companion exit (with non-zero exit status preferably), so any kind of stability measurement or restart logic is impossible to implement.
People have reported that installing the raspian lite image makes usb stable with stream deck using headless.js. Not sure about latency though.
I have been using rasbian lite from the start.
I was having problems with the StreamDeck going black/disconnecting. It could be reconnected by rescanning USB from the surfaces tab.
I believe this was due an under-rated power supply to the Pi. The 3 B+ draws more power than previous models, added to the power draw needed by the Streamdeck, it is possible to overload smaller PSUs.
Check /var/log/kern.log for messages like kernel: [ 1612.002574] Under-voltage detected! (0x000d0005)
Apparently the stability problem was indeed power related, one of my trusty anker cables seems to be failing.
So will integrate a decent 5V 6A power supply into an enclosure with the stream deck and the pi and feed the power in from the GPIO pins.
Still leaves something to be desired on the update rate of the stream deck buttons. The rpi cpu is barely doing anything, so the bottleneck is somewheere else, load average is at 0.52 at most for 1 minute average.
Memory usage also seems to be in control with only one companion module included in the three, leaving most of the memnory free running on the raspbian lite. Free memory seems to average about 570Mb.
Testing with a 6A 5V power supply and a 4700uF smoothing capacitor actually yields worse results, the stream deck goes away in under a minute :/ no under voltage alerts in the log and no dmesg messages about usb devices disconnecting or re-enumerating. Weird...
I've had good success so far, switching from a 1A to 2A USB power supply. Haven't done any extensive soak testing yet, but it seems to have fixed the problems for me.
Will need to bring my laptop next time to troubleshoot further and verify that the unit is working with the macbook, starting to think that it might have a fault in the cable or something like that.
Tried to give 5,2V to the 5V rail and it is still unstable for the stream deck, the web interface works just fine and rescanning re-aquires the stream deck just fine until it disappears again. (no kernel messages about renegotiating the usb link or anything like that.
Continued testing with the latest head 9a404b03e3a15ea52c00c706511b30ba94916d4e button updates are steady but the stream deck still disconnects from companion after a short while. I have two different stream decks to test with currently and both are showing the same behaviour.
Running with raspberry pi 3b+ and stretch lite. Next up: testing with rpi 3b
And same results with rpi3b :/
Depili - I'm trying this too, albeit using Irisdown Countdown Timer 2, which only provides feedback every second. My circumstances for disconnection are as follows:
Displaying a page of buttons with feedback variables in their text - in this case the countdown time remaining
I'm using a Pi 3b (non +) I get the following when the stream deck goes blank:
lib/device unloading for 0001:0014:00 +2m
lib/action adding action bitfocus-companion:instance_control +2m
lib/action adding action bitfocus-companion:set_page +1ms
lib/action adding action bitfocus-companion:inc_page +1ms
lib/action adding action bitfocus-companion:dec_page +0ms
lib/action adding action bitfocus-companion:button_pressrelease +1ms
lib/action adding action bitfocus-companion:button_press +0ms
lib/action adding action bitfocus-companion:button_release +1ms
lib/action adding action bitfocus-companion:panic +6ms
lib/usb/elgato elgato.prototype.clearDeck() +2m
Rescanning in control panel finds it again.
Is this what you see? The same config, running on the same Windows bleeding edge build does not exhibit this bug.
Pretty similiar, don't have the exact logs. My gut tells me this might be a timing issue, where the graphics generation takes enought time that it blocks some usb transactions that end up timing out because of it.
This is plausible since nodejs executes code in one event loop per process and if processing one event is ongoing all others are blocked until that event finishes. The Pi being quite bit slower means the event loop is blocked for longer periods of time and the likelihood of timing collisions because of that increases greatly.
The companion code itself seems to be smart enough that it only updates the graphics when they actually change and not on each feedback update (if the data stays the same), which helps as my current implementation sends the osc feedback on each frame.
Will need to test this by having a setup with no feedback at all, and thus no button graphics regeneration, and see if it alters the stability. I should also have access to more powerful bananapi board soonish to test.
If it indeed is caused by the event loop timing issues I don't know what could be done other than trying to separate the gfx generation and usb into diffirent nodejs processes. I haven't even tried to look into the companion codebase to find out what kind of multiprocessing, if any, it already does.
With regards to your further testing comments - if the Stream Deck is showing a page without feedback (ie buttons are on Page 1, Stream Deck is showng Page 2) then it's stable. Likewise if the same buttons are shown without making use of the feedback text. (I had 3 buttons showing remaining hours, minutes and seconds of my countdown timer.) I also feel this is almost a separate issue and might want splitting off from #313.
On Sat, 16 Feb 2019 at 00:04, depili notifications@github.com wrote:
Pretty similiar, don't have the exact logs. My gut tells me this might be a timing issue, where the graphics generation takes enought time that it blocks some usb transactions that end up timing out because of it.
This is plausible since nodejs executes code in one event loop per process and if processing one event is ongoing all others are blocked until that event finishes. The Pi being quite bit slower means the event loop is blocked for longer periods of time and the likelihood of timing collisions because of that increases greatly.
The companion code itself seems to be smart enough that it only updates the graphics when they actually change and not on each feedback update (if the data stays the same), which helps as my current implementation sends the osc feedback on each frame.
Will need to test this by having a setup with no feedback at all, and thus no button graphics regeneration, and see if it alters the stability. I should also have access to more powerful bananapi board soonish to test.
If it indeed is caused by the event loop timing issues I don't know what could be done other than trying to separate the gfx generation and usb into diffirent nodejs processes. I haven't even tried to look into the companion codebase to find out what kind of multiprocessing, if any, it already does.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bitfocus/companion/issues/313#issuecomment-464255365, or mute the thread https://github.com/notifications/unsubscribe-auth/APZxOdsyUTur-U5sCeskQ8aHsuVyzxcGks5vN0scgaJpZM4X_0FJ .
I'm having the button blacking out issue too. Raspberry 3B+, fresh install of Lite, latest git head, Official wall wart PSU.
@canoemoose I agree with starting a new issue for this specific problem.
If I get time, I will get the latest commit installed on my RPi, and get to testing.
for now, I'll leave this here:
I have encountered this problem. RPI3b+ running debian stretch.
It seems to be an issue with the redrawing.
If you SSH into it and run top
. Then spam the page change and action buttons, you will see a node process start eating up resources, then get killed when the kernel (I think) is getting resource-starved.
Companion is still running, and can still be accessed via the webUI.
I haven't been able to find the correct log/command to find WHY the process was killed.
And I haven't been able to trace it back to the actual JS files that are run in that process.
I'm not brilliant with linux!
It seems that redrawing the streamdeck takes a lot of CPU. Or it is really inefficient on RPi architecture.
Luckily it is in a different thread, so it doesn't tank the whole program when it is killed.
There might be something that can be configured in the OS to stop it from killing the process - again, I dont know linux that well.
Or there might be a better way to redraw the streamdeck. But I don't know enough about this codebase to even start suggesting possibilities!
But when I get time, I will be poking around and try to reproduce it in a way I can then trace the issue
Actually it isn't the os killing the process (you would see a log entry and the pi isn't running out of resources) but the fact that the USB communication and picture generation all run inside the single threaded nodejs event loop.
If the image generation takes too much time (as happens with the rpi regularly) the usb communication handled by the same event loop times out and the connection to the stream deck drops on the level of the nodejs driver, the kernel itself doesn't see the streamdeck dropping.
The real solution would be rewriting companion such that picture generation and preferably the streamdeck handling would be in their own distinct processes and would communicate over sockets between the processes. The current setup can't take advantage of more than one core out of the four that a rpi has.
A band-aid solution might be replacing the pngjs library used by companion with something that is able to use external threads / native utilities for the image generation, as doing computation inside the nodejs event loop is just asking for problems. This would probably cause great headaches for cross-platform compatibility unless you package utilities like imagemagick with the distribution binaries of companion.
So, perhaps the images could be generated and cached when a button is created?
Or - by image generation - do you mean placing the images in the correct place for each button?
@depili where are you getting your information from? For starters, each usb device is forked into a seperate process. It is not all done in the same runloop. And next off, nothing in the usb part is timing out upon sending data. There is no timeout. The deck can however time out and reboot if you push a button, and don’t read it from the device in a few seconds. But the hid driver in node is constantly polling for key press data. As for image creation. They are only generated when they change. All buttons are cached at all times. Changing a page on a deck simply transfers cached image data from ram (via ipc and then usb) to the device.
I have yet to find the real reason some stream decks disconnect from raspberry pi. But as far as I know, this only happens on rpi. No other similar embedded platforms (again: as far as I know)
But @depili, please keep a clear line between facts and suspicions (in lack of better words).
@haakonnessjoen you mention that this doesn't seem to be an issue, that you know of, with other embedded platforms. Do you have one in particular that performed well? I'm trying to create a fully-enclosed "shot box" for my church that we can use in our smaller spaces rather than having to buy $$$$ control equipment (i.e. ATEM consoles). I picked up a Pi because I'm familiar with the platform, but if there's another similar platform that you know works, I'd love to give it a try.
The banana pi M3 works somewhat, that board has really bad wifi and sd-card performance and only supports one version of ubuntu with the Chinease kernel...
I'm basing my hypothesis about the timing out on that the disconnects with the RPI only happen with buttons with feedback active that update their graphics without user interaction and the deck disappearing in quite consistent time without any user interaction or power issues. Also all of the work is done just by one of the pi's CPUs, there is no memory stress, there are no voltage issues even when monitoring by a scope so the issue really can't be anything else than somewhat timing related to the image generation via feedbacks and some interaction of that with the usb communication.
The PI is unique in that way that it has really bad memory IO bandwith among all of the clones, and most of the clones have also faster execution in a single core.
The issue is really really consistent to trigger with just one button that has a feedback variable updating that button once per second.
All of the work is not done by one of the cores. That is just false.
@jarodwsams, I haven't had much time to test different devics, but Intel Sticks seems to work pretty good. Just make sure you have either 64bit windows, or linux installed on it. I think a lot of them come preinstalled with 32bit windows or something(?).
We are working on a program written in C (for speed) for the RPI (and other embedded devices), where the pi is only pushing usb image-data to the device, sort of acting like a usb extender over network, for companion. But we have to get companion 2.0 out, and a bunch of other stuff too, before we can release this. It also needs to be upgraded to support the XL. Since this project started a while ago.
We are working on a program written in C (for speed) for the RPI (and other embedded devices), where the pi is only pushing usb image-data to the device, sort of acting like a usb extender over network, for companion. But we have to get companion 2.0 out, and a bunch of other stuff too, before we can release this. It also needs to be upgraded to support the XL. Since this project started a while ago.
If you're referring to Companion Satellite, Viker was telling me about that on Slack yesterday and I got very excited.
Looking further into the rpi issue; when the disconnect happens, something triggers the elgato_dm to remove the device, the trigger doesn't come from the usb process error handler. And at the point where the removal is triggered the communication to the streamdeck still works at least enough for the clear_deck command to work.
After making a hack by also commenting out the remove request handler binding from elgato_dm.js the stream deck seems to be stable on the rpi. Additional grepping and hunting will be necesserry to understand what triggers the removal request and why.
Great stuff! Sounds like you are getting close to figuring out the problem, also sounds like I will be really «redfaced» when/if you find out, based on your currfnt findings 😅
This is great work! I don't think I have anything useful here. Juts trying to figure out how I can help!
If you are saying that the removal request is not coming from the usb handling process ( https://github.com/bitfocus/companion/blob/1929553e6447cd038f2c9329ad802481a61de3d5/lib/usb/elgato.js#L77 ), then I think it has to come from the draw function ( https://github.com/bitfocus/companion/blob/1929553e6447cd038f2c9329ad802481a61de3d5/lib/usb/elgato.js#L140 ).
Considering that this is only try/catching to draw, without any error logging, its probably worth putting an error report in here, and see what the fillImage method ( https://github.com/bitfocus/node-elgato-stream-deck/blob/25c9b17bc257c61580e27de5f977794850e7ff05/index.js#L180 ) is doing.
Unless I am mis-understanding how the error handling is working!
Looking further into the rpi issue; when the disconnect happens, something triggers the elgato_dm to remove the device, the trigger doesn't come from the usb process error handler. And at the point where the removal is triggered the communication to the streamdeck still works at least enough for the clear_deck command to work. After making a hack by also commenting out the remove request handler binding from elgato_dm.js the stream deck seems to be stable on the rpi. Additional grepping and hunting will be necesserry to understand what triggers the removal request and why.
@depili What line(s) did you comment out in elgato_dm.js? I'd like to replicate your band-aid.
And digging deeper the trigger for the disconnect is an exception on drawing a tile on the deck, which is non-fatal exception that is just being processed without a log message as a reason to disconnect the deck.
The exception trace on a write error looks like this:
at StreamDeck.write
(/home/pi/src/companion/node_modules/elgato-stream-deck-clean/index.js:141:22)
at StreamDeck._writePage1
(/home/pi/src/companion/node_modules/elgato-stream-deck-clean/index.js:264:15)
at StreamDeck.fillImage
(/home/pi/src/companion/node_modules/elgato-stream-deck-clean/index.js:202:8)
at elgato.draw (/home/pi/src/companion/lib/usb/elgato.js:140:19)
at process.
Adding a simple retry logic to the draw command like:
elgato.prototype.draw = function(key, buffer) {
var self = this;
var retry = true;
buffer = self.handleBuffer(buffer);
while(true) {
try {
self.streamDeck.fillImage(self.reverseButton(key),
buffer);
return true;
} catch (e) {
console.log(e);
if (retry == false) {
self.log('Error drawing:', e);
self.system.emit('elgatodm_remove_device',
self.devicepath);
return false;
}
retry = false;
}
}
}
Seems to be enough to keep the connection stable. The draw errors are quite rare.
A quick look by adding debug logging to all of the elgato_dm functions didn't find any timing correlation for the drawing failures, but my quess would be either there is something pre-installed on the raspian image that tries to look for certain usb HID devices or something else than the elgato_dm is doing usb scans that lock the devices briefly.
The failures seem to be so random and quite rare, with time between failed writes being from 3 seconds to 5 minutes so trying to find what conflicts with the access will be really annoying unless something just jumps out.
The issue not manifesting on the banana pi at all even after tens of hours leads me to suspect that is some other package accessing the usb.
perhaps a more elegant solution would look like...
var self = this;
buffer = self.handleBuffer(buffer);
try {
self.streamDeck.fillImage(self.reverseButton(key), buffer);
return true;
} catch (e) {
console.log(e)
}
// failed to redraw, try again
try {
self.streamDeck.fillImage(self.reverseButton(key), buffer);
return true;
} catch (e) {
self.log('Error drawing:', e);
self.system.emit('elgatodm_remove_device', self.devicepath);
return false;
}
I would consider it a bit confusing to use a while(true)
loop with return
statements.
However, I will leave it up to the maintainers to actually code it!
That's awesome work! If this gets merged, @depili, how can I buy you a beer?
Great stuff guys!
It would be nice to see what the error actually is, when this happens on raspberry pi. If it is an exception that is differnet from when there would be an actual non-recoverable error. That way we could retry the drawing routine if it is in fact just a temporary problem, like it clearly is with the raspberry pi at times.
@Towerful Trying your flavor now. I'll be able to test it later this evening. Quick ? because I'm not a coder but can follow instructions and replicate.
Does this last return true;
get left alone, or taken out? I'm assuming it gets taken out because it was added explicitly to the first try().
Javascript unfortunately lacks the traditional retry keyword for try-catch and the while loop is the "normal" way of implementing N retries, in this case just having another retry seems to be enough.
My first quess was the usb-hid hotkey daemon rpi uses (triggerhappy) but that didn't seem to be it. Catching what the conflict is would then need luck with lsof polling or another hunch on where to look.
The error probably comes from node-hid on the undelying layer from https://github.com/node-hid/node-hid/blob/6f36b9fd5eae0ac05a4f0066730c33e2eab153fd/src/HID.cc#L182 (or in the similar location on the bundled version instead of GIT head. And as such being certain if this is a recoverable case is quite hard, but retries should be fast and given low number of permitted retries and then giving up and erroring out I don't see why it would be a real problem.
@jarodwsams
the final return that your cursor is on is irrelevant.
All code paths of the final try/catch statement have a return statement, so that final line will never be evaluated.
No harm in taking it out, no harm in leaving it in.
@depili fair enough regarding the retry count. The majority of the coding I've done has been for personal/internal stuff, so error handling has always been 'report it and abort it'. I've never really played with retrying stuff!
You are definitely getting deeper into linux/node than I can really help with.
But if we are retrying I think we should add a bit of delay at least. So wait 10 or 20 ms and retry. Or else I would think it would be very probable that it would get the same error again.
Delay might be reasonable, but it will then block, so far just retrying straight after seems to work, it all depends on the nature of the conflict, a delay might make it work better, it might harm things depending on the frequency of the other access...
We pretty much know that the conflict can repeat at least once in 3 seconds (shortest time between draw failures I have seen) so that is the upper limit on how often the accesses happen.
Since currently the deck is updating one time every second or so (depending on the rpi lag and load), I would say that the other access has period of less than a second, so in this case having a delay in the retry might be harmful.
I am not talking about blocking delay. You don't do blocking delays in node. A setTimeout simply exits the function and runs the callback at a later time. Hence not blocking anything, but postponing the last write.
Considering this isnt an async function, perhaps a setTimeout
or a nextTick
callback to try/retry the draw is in order (or would help alleviate the issue)
Exactly what I was suggesting @Towerful. And this code is run in a separate process too, so if you would block anything, it would only block that one stream deck. But there is no blocking. :)
The issue with that we don't really know what the frequency of the other blocking access (or if it really is a other access and not the node-hid library failing randomly, there have been few issues on that project like it in the past) we can't know what the proper delay would be.
If the other access happens every second, then any delay less than a second would be ok, if the other happens every 20ms and you use a 20ms delay then you increase the risk of the retry hitting the same issue by adding the delay.
The delay also decreases the response time, but considering the sluggish updates on a rpi that doesn't seem like a concern.
It would require experimentation, but so far retrying once immediately hasn't failed a single time.
Starting Companion on boot of a Raspberry PI without the need for a display connected or logging in to a VNC or SSH session.
Preferably this would be enabled via a systemd service.
Some basic config options exposed to the command line such as setting the network interface