Ethos performance issues by using bigger model templates together with several luas or reading several sources in one lua

Hi,

unfortunately this will be a long text, hope i'm not boring too much.

short version: an extensive model configuraton produces such an processor load, that UI "feedback", audio and lua widgets are significantly compromised. i do assume that high priority "output calculation loops" costs so much processor steps, that sub-prio tasks are slowed downed very much.

regarding lua widgets, my tests showed that especially the source..:value() method consumes a lot of runtime, maybe there is room for optimization.

All in all, i recognized that running several widgets and source scripts is not the best idea in such an evironment, "performance scaling" of Ethos about number of lua scripts is not linear, so it was better to put all "tasks" in one big script.

At least it would be fine if

source..:value() method could be optimized
review, if the "ouput calulation loops" within ethos really do need such high prio is there really a differentiation if RF is running in "standard" mode (which could accept "slowed down" loop cycle time), and "fast" mode for racing which needs 4ms update rate? )
let this be an argument on the long run (maybe X24 development) to think about some even more potent hardware. however i'm afraid this would mean a significant source code redesign, wether for 2 core STM32H7's, so one could perform output calculations on a dedicated core, or (probably even more work) support H7 successor mcu's
sharpen "situational awareness": it's my impression that further implementation of the sheer number of tiny "bells & whistles" issues will significantly drop the Ethos performance limits

thanks & regards Udo

Let's look at some measurements: attached you'll find two screenshots which indicates the performance of a simple lua widget on a extensive model template the lua widget reads 8 values. In "simulating" mode, it displays the values from predefinde variables (which is quick enough) In "no sim" mode, it reads analog values from inputs.

The cycle time (time between the calls of the wakeup handler) is averaged over 10 loops and displayed. This time corresponds to the fastest possible log write time !

simulated valb

srcReadb

You see an increase of app 120ms by reading 8 source values. in "real world" i've a widget which reads about 17 values (individual telemetry display and some status infos), so that i'm getting about 400 ms loop cycle time At the start of my "research" i'd split this widget into several ones and got about 800 ms! What worries me is the fact that nearly same functionality was reached under oTx without performance issues

Even when the rest of lua coding would've been optimized to the end, this "reading" contradicts everything else... So it seems to be a good point to optimize the source:value() method if possible.

performance.zip

and now the very long version, for these with deep interest:

during the last months i created a sort "universal" glider template which should accomplish my needs for 90% of my models. (up to 6 flaps glider, from sports models up to scale gliders) I use the template together with several lua scripts & widgets, so that i got same & some more functionality as i had with my former system. All in all i got the experience, that i gained 20% more functionality but lost 40% performance, despite the higher grade MCU, which is annoining specially during slow UI, stuttering audio and very slow log write times (app 1 sec, even by 200ms configured)

i spend a lot of time

to determine root cause(s)
optimize / shrink template
"concentrate" lua widgets

At the end i was able to get significant more performance, but didn'd reach the "look & feel" like before. This is caused mainly by 2 causes:

reading source values (telemetry / LSWs / analogs ...) within lua seems to be very resource intensive
it seems that the rising complexity / functionality in Ethos results in problematic performance "scaling" when you use higher numbers of Mixers, LSW's and SF's , which is neccessary due to the "stacked mixer" agenda

The performance issue occures in use cases where someone tries to put a "heavy loaded" model config together with several lua scipts/widgets.

Some more detailed infos:

(I)

during development of some luas i could recognize performance issues (stuttering sounds, slow GUI, slow log recording, slow touch response...) Further investigation showed three main factors, together they resulted in an extreme slow environment.

a so named "fat model" / template with a huge number of mixers, LSW's and SF'S
the number of telemetry / source value readouts from a lua widget
the overall number of luas (two background/source luas and three widgets)

The most essential element is the size of the model template, this "defines" the overall behavior. Number of mixer lines, but also number of LSW's and SF's are important. The template was customized with the idea of an "one for all" template, up to a 6 flap scale glider template, supporting several functionalitie the canned mixer won't give, and so the "stacked mixer" following idea, fat client, was built

i was able to reduce the amount of mixers by waive some "convenience", together with the process of condense all widgets/scripts into one big lua i was able to optimize performance by app. 25%, nevertheless it's not really satisfying.

further investigation showed that almost 50% more "speed" could be reached in case i simulated telemetry and other sources (LSW's, analogs, channels..) by simple variable "readout" instead of calling the source class instead readout real world values. this would be my 1st prio issue: if possible, please boost the source...:value() method

(II) my personal interpretion: my guess is that the higher complexity / user requested functionality of canned&free mixers, of LSW's and SF's is producing much more CPU load then under oTx. Despite the better chipset, the (maybe exponential) higher rate of "steps" to evaluate mixer loops etc.. significantly reduces the overall performance in case you developed a more then usual template. maybe, on the long run, if more and more "tiny functionality requests" are implemented, the scenario will reach standard model templates i don't think brute hardware updates could be a solution (would be fine for customers, but dual core MCU or new gen ARM's will cost a LOT of implementation work i guess)

(III) regarding comparision to otx, in a comparision of an otx template with similar functionality,

the otx template showed no performance issues, but, of course, would only fullfill app. 90% of the ethos functionality

main differences are

ethos / tandem TXs supports internal gyro, which is used for some announcements and so n mor LSW'S SF's
ethos supports much more channels, so the idea rose to "reserve" ch1-16 as a "patch field" and give the user easy freedome of channel assignment even with accst
otx supports only 9 GVARS, the actual ethos "VAR"mix variant is used to establish a group of x vars for easy trimming of setup values
there is no dedicated "Input" handling ethos; to establish a "one point of thruth" philosophy for inputs which are distributed over several mixers and to minimize fails in configs, some dedicated mixerlines are used as "centalized" inputs

(IV) Bad scaling of number of luas: By using a very complex model setting, the number of lua scripts/widgets scales bad & produces slow (very slow) cycle times of script executions / logging rate. even 2..3 tiny scripts can produce situations, where widgets are only executed nearly 1 time / second; this gives bad UI feeling in case you implemented touch events in your scripts etc.. one big script, which's runtime is some multiple of the runtime sum of the tiny scripts, shows much better cycle time (so i'm speaking of bad scaling). even logs are written very slow

One question comes in mind: does Ethos support different "handling" of channel value evaluation when in race mode to fullfill the harder timings, or does it always use the same "priorizing" , so that these short latency in theory would always be supported, but not needed by the rf part when not in race mode? .. this could lead into "Overprovisioning" in channel value calculations when in standard mode, which would further lead into slower servicing of other functionality like UI / lua etc....

(a) Measurements: To get the "big picture" of the outcome of different "loads", i built something like a two dimensional "load matrix"

on the y-axis i've four different complex model configs: (1) one xtreme heavy loaded, (2) the same without SF's, (3) the same like (1) but mixers only, (4) one absolute "lean" without any mixers, SF's, LSW's )

these are prooved against six different lua settings on the x-axis a) two source scripts & two widgets(one small & one "heavy"), b) like a) without src scripts c) two "small" widgets, d) one "xtreme" widget

measured are averaged cycle times between two executions of the same widget.

widgetScaling

(b) interpretation of measurements:

regarding scripts, the number (and not the "single load") of luas causes mainly bad cycle times. even some tiny luas can slow down the system (in the sense of UI response time, speech quality, log writing..) a lot more than a single one which produces multiple more "load" then in sum the single ones. this is effect will occure together with "demanding" model templates
regarding model templates: NOT the complexity/number of mixer is the main driver, it seems there is no "real main driver" for the load. mixers, LSW's and SF's seems to cause CPU-load in the same manner !

my conclusions: (at least for me)

i'll try to "optimize" without comprehend my agenda (flexibility, one for all, try use a standardized template without too much rework (deletions of unnesseccary, emulate input mixers...)
- by culmulate my several widgets into a "big, fat one",
- and i try to get enough free ressources by deletions of unused LSW's and SF's after a specific model config, in hope that i'll get a smooth UI.
only the last step would be "cut down" mixers and "lean code", let's see if it'll work.

Ethos offers a huge max number of Mixers / LSW'S /SF'S etc.. of course, every single one costs hw-ressources

These days the "typical" model config won't let run an average user into bottleneck situations, because he won't exceed these "limit numbers". but even without let the number of mixers etc. grow, there are a lot of (i'll name them "specialized") open issues which implementations will slow down evaluation of every single mixer / lsw / SF.

In the future there may be the risk that the "overall" number of mixers the system is able to handle without performance issue will decrease over the time / over Ethos development.

Especially in case you want to run several scripts, you can run into this situation.
So (my point of view) one should be aware that every added functionality will cost performance on the long way.

Nevertheless Ethos, shurely, will be extended the next years, will the HW be capable to fullfill the demands of the software ?

As far as i know the TX's are based on stm H750 MCUs, so there is not much headroom for single core updates (maybe 12 percent due to higher clock speed)

Are dual core MCUs an option(divide priorized & secondary tasks on two cores)? I'm afraid a big code rework would be the consequence,and worst case it ends up into two code bases to be maintained.

STM's marketing tells that the m85 based N6 MCUs are the upcoming successors of the H7 series, it shows up better DMIPS / Coremark Performance, but these are complete new "beasts" with AI capabilities (which boosts performance up to x00% vs H7 in these applications), but code compatibility is a an open question..

Maybe an idea for X24 Hardware.

You create a new Source each time you calll this method, this is far from optimized!

televal     = system.getSource({category=CATGY, name=TeleName}):value()

Thanks for the hint, will make new measurements without generating new sources in every loop

By calling doing lcd.invalidate() there, you force the LCD to be refreshed at each loop. This is a very big optimization compared to OpenTX that you are killing by doing so!

local function wakeup(widget)
    widget.cycletme = calcLoopTime()
    lcd.invalidate()
end

Evaluation of LSW, FSW, Mixes is done within the "mixer" task. It has a higher priority and it costs almost nothing compared to what is done in the "main" task. This is true even if you have a lot of mixes running in the same time.

Refresh the LCD at each cycle (without checking if it is needed), create / destroy sources, write logs on SD, dump screenshots, etc. will take the time of the "main" task and if your scripts are not optimized, they will completely kill the performances of the radio.

Hello Bertrand,

thank you very much for your quick and valuable support, I appreciate it very much!

(1) Source...:value() i changed my test widget so that each source is created once the performance is now the same as in "simulated" mode, voila opti sml

So I'm going to apply this to my main widget, I think it will give a significant boost because I usually read a lot of source values per loop.

(2) lcd.invalidate()

several of my assumptions and I guess you will convince me otherwise.

So, from the point of view of a hobbyist/autodidact "programmer":

a) in a "typical" use case I read at least 17 source values per loop. the "cycle time" between two loops is about 700ms by now, so about half of the values have changed between two calls. i would estimate that i need to update 25% of the display at each loop, so i chose to update the whole screen at each loop.

My (probably wrong) thought was that the time to estimate which area to update might take longer than the extra time to update the whole screen.

b) as far as i understand, lcd.invalidate("area") "triggers" the paint handler almost immediately and updates only the specified area. i don't know exactly how it works, my guess is that ALL "display" methods are executed in the paint handler and will fill a "fast" cache memory, only the "marked" memory area is pushed into the graphics controller using a "slower" bus. this way the data volume is massively shrunk >> more performance.

This would be a very good optimization if you only need to update one coherent area. In my use case, several SCATTERED areas needs to be updated in a loop and therefore I feel that the invalitade method with a limited area would not fit to my application

c) see two screenshots from my current "lua project". It's a full screen widget, supporting several different topbars

userInp sml

The user can select multiple apps in a widget that supports two "main areas", each area can hold up to 3 freely selectable apps.(i call them sub-widgets) Which "apps" should be used (max three in a "frame") can be configured individually. Some "apps" can offer multiple pages, like the "telemetry app" the selection (next/previous page; next/previous app) is done via touch events

As far as I know, the handlers (paint / wakeup / event) are "independent" routines that can be called with different frequency from Ethos OS. Unfortunately, I never checked if there is a fixed order....

I never found more detailed information about dependencies or "rules" between the different handlers. My guess was that I would run into persistence problems if the wakeup handler determined a "refresh range" and called the paint handler, but before the paint handler started updating the estimated refresh range, the event handler detects a user input and triggers a change of the active application/page, so the display handler tries to paint a completely different "application" but only refreshes part of that area.... (hope you understand what I mean)

all in all i came to the conclusion that a full refresh every half second or so would be the way to avoid all these possible problems.

I'm sure you can tell me my fails and how i can do better (-;

My main "problem" by now is how to interpret your comment in the right way: Refresh the LCD at each cycle create / destroy sources, write logs.... they will completely kill the performances of the radio.

By now i initiate the complete refresh every 700ms which is the "settled" cycle time due to the fat template in combination with the insuficient coding

Let's say there are only 6..7 changed areas (two timers, VFR, RSSI, altitude, GPS, distance), this would result in 7 lcd.invalidate(area) commands over a "scattered" field: This will trigger 7 paint handler calls in a row, this will execute all "paint" methods to fill a complete screen and update only the requested area, but in sum it will be even much more performant than triggering the complete update every half second.

right, would this represent the "blue print" how to deal with display refreshes in order to use as less ressources as possible ?

It would mean a major redesign of my widget but it may be worth

I would split the code into 2 parts:

1) wakeup()

retrieve all variable parameters you need to draw, such as Consumption, Altitude, etc.
compare them with the latest parameters, if different you call the invalidate() function, for the part the smallest as possible, Ethos keeps in RAM the whole LCD and doesn't refresh everything, it's a huge optimization compared to OpenTX. It refreshes the smallest rectangle which is around all rectangles which need to be refreshed.
stored those variable parameters

2) paint()

don't check everything again, just draw the latest parameters stored

Also it's possible to do the job described above in wakeup() only every 200ms or so, it should be enough!

700ms is very slow, it should not take that long! Take care not to use the SD too much.

FrSkyRC / ETHOS-Feedback-Community

Ethos performance issues by using bigger model templates together with several luas or reading several sources in one lua #2830