Xpra-org / xpra

Persistent remote applications for X11; screen sharing for X11, MacOS and MSWindows.
https://xpra.org/
GNU General Public License v2.0
1.9k stars 164 forks source link

take content-type into account when choosing an upscaling sampling filter #4326

Open totaam opened 1 month ago

totaam commented 1 month ago

Without opengl when converting yuv video via libyuv: https://github.com/Xpra-org/xpra/blob/a8bb7c1e52ea9faa05f21f9e34e451a9ff8ded37/xpra/codecs/libyuv/converter.pyx#L183-L190 Speed is largely irrelevant client-side, so perhaps just use the same sampling as what was used server-side. (should already be exposed?)

With opengl:

Lanchon commented 4 weeks ago

i happen to have a background in signal processing theory. so i could help you with some of this issues, if i first understand the problem setup.

as i understand it: 1) you got a capture in the server (window or desktop) 2) you may want to scale it (is it always same size or downscale? can it be upscaled at this stage, and why? can the scaling factor differ in both axes? 3) you encode it (steps 2 and 3 may happen together as they may involve color space conversion) 4) transport it 5) decode it on the client 6) resize it 7) display it

which steps have GPU/CPU options? which steps are computed together under each processing unit? (eg, decode and resize)

i suppose windows and desktops are handled differently, as i've seen scaling factors that only apply to desktops.

could you describe the current/planned settings affecting scaling in 2) and those in 6) for windows? same for desktops?

is shadowing handled the same as "regular" desktops?

Without opengl when converting yuv video via libyuv

what is the speed parameter there?

Lanchon commented 4 weeks ago

Without opengl when converting yuv video via libyuv

a bit on filtering...

upscaling

proper upscaling requires at least bilinear filtering. some upscaling factors, such as exact integer factors, seem to not necessarily require bilinear filtering because order 0 (nearest neighbor) filtering actually produces a similar result to bilinear filtering for those factors, IF specific phase values (in space, i mean X and Y) are chosen for the bilinear filter. but the results are not the same, as bilinear involves a decrease in high spatial (again, X and Y) frequencies that order 0 does not provide. but both would reject spatial frequency aliases equally well, which is more important visually.

bilinear filtering really turns into a no operation in a 1:1 scaling, IF specific spatial phase values are chosen for the bilinear filter. these phases are reasonable (zero for each axis), so if the bilinear filter is implemented so that phases tend to zero when scaling tends to 1 (which is a reasonable thing to do), then the filtering can be omitted for the 1:1 case without affecting output in any way.

in general, if a smooth continuous appearance across different scaling factors is desired, filters should not be omitted for specific scaling factors (such as integers). this even applies to 1:1, unless phases have been taken into account. i understand that many filter implementations might not allow you to set up phases, but this doesn't make them magically disappear; they are there anyway. btw, manipulation of these phases in graphics is sometimes called subpixel rendering.

downscaling

bilinear is absolutely not enough for proper downscaling. it only works well if the image to be downscaled has low energy in high spacial frequencies (ie: it is "smooth", out of focus, or you could say it's already "pre-filtered"), and only if the downscaling factor is not big. otherwise strong aliasing occurs.

for example: take an image made of alternating 1 pixel wide black and white lines. this image has large energy in the high frequencies in one spatial direction. now downscale it to 49% using bilinear. what you should see: a 50% gray image (ignoring gamma). what you get: an image slowly cycling from black to white to black again, forming a cycling gradient in the direction perpendicular to the original lines, and with a spatial frequency related to 2% of sampling frequency (ie, 50 pixels).

note that computer images and text tend to have high power in the high spatial frequencies, so this problem is exacerbated.

edit: this is why bilinear alone cannot be used for 3D graphics. scale down a texture enough and you get mostly noise. so texture Level-Of-Detail variants are precalculated (ie, smaller prefiltered versions of each texture) and then bilinear is applied from them. but you dont want to abruptly jump from one LOD to the next when getting closer to a texture, so you use trilinear, interpolating between the outputs of 2 bilinear filters on the 2 closest LODs for that scale. (which also doesn't work well because the scaling factors could be widely different in 2 of the axes of a texture projection, so you want anisotropic filtering instead.)

the point being that this shortcuts don't apply to xpra as the "texture" is constantly changing, so you cannot precalculate LODs. in the extreme, imagine that you want to downscale a desktop to a single pixel. which pixel you choose? or which 4 pixels you take a bilinear kernel on? answer is: you need to average (or weight) all pixels of the desktop to downscale correctly. this is very onerous for calculating a single pixel, yes, but compared to encoding the whole desktop as, say, a jpeg, it is very cheap.

this is all generic, but i can help you find specific solutions when i understand the needs, hence my earlier message.

totaam commented 4 weeks ago

It's always helpful to write things down, I think I know what the problem is.

you may want to scale it (is it always same size or downscale?

At the moment, always downscaling before compression to save CPU and bandwidth. The server-side downscaling can be turned off with xpra --env=XPRA_DOWNSCALE=0 ...

can it be upscaled at this stage, and why?

No.

can the scaling factor differ in both axes?

At the moment, no. The desktop-fullscreen switch (which makes desktop windows fullscreen on each of the client's monitors) may require this to change one day to accommodate fixed-size server windows without using padding around the edges.

  1. resize it
  2. display it

It's a bit more complicated than that, especially when using the desktop-scaling switch and / or with High DPI displays:


I think that the quality loss reported in https://github.com/Xpra-org/xpra/issues/4324#issuecomment-2288029609 is caused by the unnecessary scaling. We upscale to the back buffer then downscale again to display it. I have created a new ticket for this: #4331 I am not sure when I will get around to it as xpra is mostly used to upscale things, not downscale them.. (since this tends to make things unreadable - even with the best scaling algorithms)

Lanchon commented 3 weeks ago

(in my previous msg, where it said 98% it should have said 49%. i've edited.)

side note: downscaling is important for shadowing. you have laptops that have ever increasing DPI ratios that you want to be able to see on older or more mainstream screens. you also have newer 16:10 laptop screens that you might want to view on regular 16:9 displays. or you might have the exact same screen on both sides, but still not want to remote full screen. downscaling only makes text unreadable if it was at the limit of small size for that screen. that seldom is the case for modern screens.


so ideally you'd scale only once. to control bandwidth you might want to scale twice on certain specific setups (but mostly not), once on each side. scaling more than once on each side should really be avoided.

you can do perfect scaling with math, but it is A BUNCH of processing, and nobody wants to pay that price. it's O((xy)^2); that is, to calculate each pixel you need to visit all pixels. with perfect scaling you can serially scale an image and expect the same result as if the minimal set of 1 or 2 scalings to which the series could be simplified were applied. but in reality we use very imprecise algorithms that under certain conditions work well enough for our eyes, even when we use the "good ones". serially applying them really destroys quality. you really should seek to do exactly one resize for most circumstances.

as i understand it, in the client you are coalescing a bunch of partial screen updates into a buffer the size of the original server window, even if they are transmitted scaled down (they are never transmitted scaled up, as per your description).

i imagine it goes something like this: say the server is scaling down by 50%. it needs to send a 3x3 update, which maps to a 1.5x1.x5 pixel image over the wire. you ceil (the sibling of the floor function) those values to 2x2, ask the encoder to scale 3x3 to 2x2 and encode (which effectively makes it use 67% scaling) and send that over the wire. on the client you decode the data to a 2x2 image, scale that to 3x3 (150%, not 200%) and paint it on the backing buffer.

i get the motivation for having the back buffer at 100% scale: you don't want to have the back buffer at 50%, receive an update of 2x2 pixels at 67% scale, and try to apply that to 1.5x.1.5 pixels using some kind of subpixel rendering. still, doing the serial scalings just works very poorly; it should be avoided. i also get that the codebase has a history, and that you'd do a lot of things differently with hindsight, so i'll try to center my writing on stuff that can actually be done relatively easy as things stand.

there's no way you can have a good quality 70% scaled down desktop if you scale it down, then up, then down again. also with several code paths for each scaling, this turns into a testing nightmare. keep in mind that you just said you might not support downscaling at all: the solution i'll propose is less efficient but works well, which is much better than no solution at all.

so right now you have S = desktop-scaling:

but with the current codebase, making a 70% desktop work well requires SS = 1 and CS == 0.7. it is less efficient but works well.

treating a 40% desktop this way, though, makes it extremely inefficient. for 40% a acceptable solution might be SS = 0.5 and CS = 0.8, because the 50% scale down then 50% up then >50% down again will not really distort the image much, because since the first 2 scalings have a simple 1:2 ratios, the aliased spatial frequencies will mostly be masked. this also applies to a plain 50% desktop scaling.

so you might want something like this:

with these setting, if someone requests a 51% desktop scaling you'll end up with SS = 1 CS = 0.51 which is what we wanted... but inefficient. you might prefer SS = 0.5 CS = 1.02 for this case. 60% desktop might be the same. you can account for this with something called transmission-scaling or wire-scaling or encode-scaling or whatever, call it WS for now:

WS should generally be <= 1. default could be 1 or something very near to one to accommodate rounding errors or just something close to 1. if WS = 0.833, then 60% desktop will cause a 50% server scaledown and a 120% client scaleup.

WS is interesting in other cases, eg: imagine you have 4K client and server. you want 1:1 scaling, but you don't want those huge buffer updates over the network. then you can just set WS = 0.5 and you'll get a 50% reduction of update sizes (75% reduction in area) over the network.


once the above is done, server scaling would be powers of 2 only. with this setup, it is much easier to later change the client to not having an upscaled back buffer. you'd do it like this:

if an update involves pixels from range X1 to X2 inclusive of some window or the desktop of horizontal size XS (coords from 0 to XS -1 inclusive):

where CX1 and CX2 are the corrected coords where the update should be obtained from. the CX1 to CX2 range will always contain the original X1 to X2 range, and will always select a server pixel range that will neatly correspond to a integer pixel range in the downscaled client back buffer.

there is a problem though: if the server window or desktop is not a multiple of F in size, then CX2 can potentially exceed the size of the server window (sampling content outside that window) or of the desktop (sampling out of range pixels). this can be avoided by disregarding and discarding the rightmost columns of pixels of the server window that would not be thick enough to map to a complete pixel column in the client. in essence, avoid subpixel rendering in the client by discarding the single rightmost client column if that column is only fractionally covered by the rightmost part of server image. we do it like this:

note that with this change, sometimes CX2 can be less than CX1. this means that the complete update lies in the rightmost discarded server columns, so the update event must be ignored and discarded in the server.

these changes, which allow having a scaled down client back buffer, are not something that im suggesting you do now. but you can do them in the future with relative easy, once server scaling is constrained to inverses of powers of 2 (which i believe you should do now). a scaled down client back buffer can add a lot of efficiency client-side, specially for cases where an normal client is displaying a high DPI server.

side note: server scaling factors other than powers of 2 could be implemented, such as fractions of small integers like 2/3. the upside is more efficient transmission of updates. the downsides are: 1) more of the rightmost edge of the server image (and bottom edge of course) would have to be discarded in the general case, or else subpixel rendering must be addressed, which increases complexity. 2) each transmitted update will grow in area WRT the area that was really updated, making the partial update less efficient. (both downsides degenerate to pathological levels when irreducible fractions of large integers such as 2017/3000 are used as scaling factors.) i don't think that you'll want to develop anything beyond factor of 2 server scalings, but i can help you if the time comes.


i'd suggest that in all new developments you strive to treat X and Y completely independently, or you might never get to it later. all scalings can be treated like that. you don't need to modify the UI, the UI can be changed later. so prefer to write f(double scale_x, double scale_y) {...} even if you'll be calling it for now like f(scale, scale).

regarding scaling algorithms and which filters to use, i'll write later.

totaam commented 3 weeks ago

as i understand it, in the client you are coalescing a bunch of partial screen updates into a buffer the size of the original server window, even if they are transmitted scaled down (they are never transmitted scaled up, as per your description).

Exactly.

you ceil (the sibling of the floor function) those values to 2x2

No, we never scale such small sizes. We are normally dealing with VGA or higher when downscaling, and usually this applies to "video"-like source content. When that happens, we round the source dimensions down to the nearest multiple of 2 and send the missing line and / or row using plain rgb. ie: 501x256 at 1:2 scaling sends 250x128 h264 plus 1x256 rgb. Re-assembled into a single update client side.

WS is interesting in other cases, eg: imagine you have 4K client and server. you want 1:1 scaling, but you don't want those huge buffer updates over the network. then you can just set WS = 0.5 and you'll get a 50% reduction of update sizes (75% reduction in area) over the network.

That's already the case. I should have made it clearer. The desktop-scaling switch applies to the whole session, but the server is free to use more aggressive scaling as needed - per window. The quality, speed and video-scaling options are used to control how much automatic scaling is applied. Then there are a bunch of heuristics for figuring out what type of content we are dealing with (ie: never downscale text windows - even during bursts of updates, rough framerate, video encoder warm up cost, etc)

once server scaling is constrained to inverses of powers of 2 (which i believe you should do now)

I'm not sure about this one. Powers of 2 are big steps, the automatic scaling uses smaller steps and this prevents flip-flopping between scaling ratios: switching to a higher scaling causes the CPU usage and bandwidth to drop, which causes the automatic quality setting to go back up, which causes the engine to switch to a lower scaling, rinse, repeat. We have counter-measures for that, but these are likely not sufficient for powers-of-2.

Lanchon commented 3 weeks ago

thanks for all the explanations.

i'm afraid i cant provide much of value because i dont know how the system works.

No, we never scale such small sizes.

i imagined, that was an extreme example with simple numbers. but still... say i'm using desktop scaling 70% and a character is drawn on the desktop. you are sending stuff 70% compressed over the wire in this case, right? how do you send this character then?

ie: 501x256 at 1:2 scaling sends 250x128 h264 plus 1x256 rgb. Re-assembled into a single update client side.

assuming your libs allow it, you'll get better results in appearance and bandwidth by scaling 501x256 to 251x128 (lowest of sizes where scaling >= 0.5, for each axis) resulting in 1.996...:1 x 2:1 scaling. (cutting videos or images in pieces will definitely induce seams.) when you receive the 251x128 image, you stretch it back to 501x256. you need adequate scaling filters on both sides.

The desktop-scaling switch applies to the whole session, but the server is free to use more aggressive scaling as needed - per window. The quality, speed and video-scaling options are used to control how much automatic scaling is applied.

i thought scaling was applied statically as requested, and bandwidth was regulated by frame rate and quality settings of codecs.

i assume that xpra generally gets small window content update events, except in shadow mode where it probably just regularly takes full screenshots.

for the shadow case, IMHO no amount of heuristics will work better than a correctly configured video codec that was developed to screencast computer displays as well as regular videos. things like scrolling require motion estimation, which you can't do on the CPU efficiently. the right coder emits I frames very seldomly or simply just once, P frames almost always, and never B frames. a encoder can decode its output, calculate the error, and feed the erro back to its input for compensation on the next P frame. this way, if decoders on both ends work exactly the same on the same data, regular I frames are never needed.

a codec will outperform heuristics including downscaling the input for lowering bit rate (and inconsistencies will be seen when the heuristic kicks in). the codec should preform much better than downscaling just by lowering its quality. imagine trying to compress audio by lowering its sample frequency: yes, you compress it, but the quality cost is enormous compared to, say, plain old mp3 encoding.

the only metric not outperformed would be computational cost. but i believe the user should be able to select settings according to their desired performance instead of trying to auto adapt computational cost. (IMHO ok to adapt to network conditions, not ok to try to adapt to server and client computational loads.) a user having a big GPU may not want to eat up 150W of power to maximize the quality of a remote desktop doing extreme motion estimation to detect large scrolls. or maybe they want to devote the GPU to remote desktop while the CPU is used as lightly as possible. a user may prefer to stress the server more than the client because they are on battery. etc...

for the window case it's less clear, because xpra receives metadata about update events. a good codec will still result in better efficiency when fed the full window, but the processing cost of the encoder having to detect what xpra receives as metadata for free can be significant especially for cases where a lot of windows are being remoted. ideally the metadata would inform the motion estimator of the encoder saving a lot of processing; but you won't implement that. maybe at some point two modes of operation can exist, the current one and a simpler one that just video-encodes the window. in time, more and more machines will have video compression hardware and not doing that might hinder performance not only in quality terms but in energy consumption too.

heuristics [...] ie: never downscale text windows - even during bursts of updates

Powers of 2 are big steps, the automatic scaling uses smaller steps and this prevents flip-flopping between scaling ratios

well if i ask for a 70% desktop scaling, given that you've told me that downscaling happens on the server by default, AFAICT you do downscale text.

so i'm guessing there are manual scaling settings and on top of those some additional dynamic scaling based on heuristics? so then this auto scaling is independent of manual scaling and it does not affect text? great! then you can lock the manual scaling to powers of 2 as discussed earlier and leave the auto scaling doing whatever it does on top of it.

i don't know how this auto scaling works or where exactly in the pipeline lies, but i imagine it would interfere with the objective of having a downscaled back buffer in the client. well the downscaled buffer is a later problem: you can start by locking manual server scaling to powers of 2, because otherwise scaling down then up then down again by arbitrary factors kills quality of everything, including text, even if text is not affected by the auto scaling heuristics.

BTW if you can't have downscaled client back buffers because of the complex heuristics, which means that in HDPI cases you may consistently have 4 times the area in all buffers, then that is another argument in favor of not using CPU heuristics and use a correct video codec for the complete window for efficiency reasons (if you have the encoder hardware in the server). (im sure downscaled back buffers can be done, but you might not want to visit the old complex code. hence a video codec-only mode makes sense in the future.)


how to scale

for now you only pointed to one library, libyuv: https://github.com/lemenkov/libyuv/blob/main/docs/filtering.md

it has modes: https://github.com/lemenkov/libyuv/blob/679e851f653866a49e21f69fe8380bd20123f0ee/include/libyuv/scale.h#L21-L27

this means that libyuv can't scale down by factors near 1:1 in a reasonable way. unfortunately these scaledown factors are the ones xpra would use, say in the range 0.25 to 1.

so how to implement?

there should only be 2 user-selected modes (as far as libyuv goes): HQ and LQ. (IMHO, HQ should be the default and the selection should never be automatic: i don't want a horrible screen all of a sudden, i'd always prefer slower updates... unless i explicitly say so.)

but i wouldn't: bilinear is bad for downscaling, i would choose s_cutoff around 1/sqrt(2). scale values of 1/2 and 2/3 are special in that they produce the same output for bilinear and box (well, it actually depends on spatial phases, but mostly). so those values would make sense as cutoffs, as they provide some semblance of continuity across scaling factors. so i'd choose s_cutoff = 2/3.

but...

look at this: https://github.com/lemenkov/libyuv/blob/679e851f653866a49e21f69fe8380bd20123f0ee/source/scale_common.cc#L1837-L1879

idk, i haven't followed the code. but it might be the case that those restrictions are applied on top of your filter choices (effectively prohibiting you from using, say, box for factors above 0.5). in this case, your filter selection reduces to:

but if this were the case, one has to ask... why even expose all this in libyuv's API instead of just a binary speed/quality switch?

regarding choosing up/down scale filters based on content... what would be the rationale? if you want visual quality, then absolutely not.(*) and if you want to reduce computational load, again IMHO you shouldn't do this. between client and server you have 2 CPUs and 2 GPUs, maybe more, and they can all be very different; you'll never do this right. computational load should be adjusted statically with flags and sensible defaults (which IMHO should lean towards high quality, as hardware gets cheaper all the time).

(*) except for very specific cases such as pixel art scaled up by specific factors, or AI that looks into the content of the image, a good scaling algorithm will be good for everything: text, diagrams, photos, etc. see demos: upscale, downscale.

totaam commented 3 weeks ago

assuming your libs allow it, you'll get better results in appearance and bandwidth by scaling 501x256 to 251x128 (lowest of sizes where scaling >= 0.5, for each axis) resulting in 1.996...:1 x 2:1 scaling

They do not allow it for the simple reason that it's much cheaper to perform RGB-to-YUV conversion before doing the scaling so that you only deal with half the data. And YUV420 subsampling requires even sizes as input.

i assume that xpra generally gets small window content update events

Not always, play a video in vlc and that window will be 100% video content. We detect that.

no amount of heuristics will work better than a correctly configured video codec

No, a software encoder (ie: x264) simply cannot deal with 4K input at a half decent framerate so you need to be able to choose what to sacrifice.

and never B frames

If you're encoding video content, b-frames help, but they're a real pain to deal with.

this way, if decoders on both ends work exactly the same on the same data, regular I frames are never needed.

We don't use I frames at all: "open-gop". This is meant to change with the quic transport: we will need the ability to insert key frames to deal with dropped UDP packets.

a codec will outperform heuristics including downscaling the input for lowering bit rate

FYI: how to drive the bit-rate is hard, and often too slow compared to lowering framerate.

but i believe the user should be able to select settings

Users want things to work, they don't want to understand the intricacies of codec compression. You would not believe the number of tickets and reports I have had where people make the wrong assumption about what a setting does or what impact it has on bandwidth / latency / quality / ..

especially for cases where a lot of windows are being remoted

Then there's also the case of having hardware limits on the simulataneous number of GPU encoding contexts you can have!

but you won't implement that.

Why not?

maybe at some point two modes of operation can exist, the current one and a simpler one that just video-encodes the window

Try --encoding=stream.

you can start by locking manual server scaling to powers of 2

Yes, I think that's a good compromise. The client should be able to handle whatever the server throws at it.

a good scaling algorithm will be good for everything

It's a shame that this page only shows upscaled text and not downscaled: https://en.wikipedia.org/wiki/Comparison_gallery_of_image_scaling_algorithms I was under the impression that use bilinear down and up was giving me a more blurry result than using "dumb" nearest then bilinear.